]>
Courier Unicode Library This library implements several algorithms related to the Unicode Standard, featuring: Both C and C++11 bindings, with a complete manual page documentation set. The library has all Unicode mappings compiled in as fast, compact, lookup table. The library does not need to load the Unicode database files at startup, every time. The library implements lookups uppercase, lowercase, and titlecase equivalents of a unicode character; grapheme and word breaking rules; line breaking rules; and the bi-directional algorithm. The library implements canonical and compatibility decomposition and composition of Unicode text; and the Unicode script property. The library also implements ancillary functions, like looking up the unicode character that corresponds to some HTML 4.0 entity (such as &, for example), and determining the normal width or a double-width status of a unicode character. Also, an adaptation of the iconv 3 API for this unicode library.
Current status The current release of the Courier Unicode library is based on the Unicode 13.0.0 standard.
Installation and usage Download the current version of the library from https://www.courier-mta.org/download.html#unicode. Use the downloaded tarball to prepare an appropriate installation package for your operating system distribution. The typical sequence of commands is:
./configure # Takes the default configure script options make make install DESTDIR=/tmp/courier-unicode-instimage # For example.
The library uses a stock configure script, make and make install command that respects the DESTDIR setting to create an installation image in the directory specified by DESTDIR. make install does not take any explicit action to uninstall any older version of the library, or remove any files from an older version that do not exist any more in the new version. Use the created installation image to prepare an installable package in a native package format for your operating system distribution. Use your native system distribution's package manager to properly install and update this library. To use the library, #include <courier-unicode.h> and link with -lcourier-unicode. The C++ compiler must have C++11 support. Minimum usable version of gcc appears to be gcc 4.4 with the -std=c++0x flag. Current versions of gcc use C++11, or higher, by default and do not require extra flags. For C++ code, as usual, the compiler and compilation flags for compiling any code that uses this library must be ABI-compatible too. The Courier Unicode library installs an autoconf macro to probe for C++11 support. In your configure.ac
AX_COURIER_UNICODE_VERSION AX_COURIER_UNICODE_CXXFLAGS AC_SUBST(COURIER_UNICODE_CXXFLAGS)
Then, in Makefile.am:
AM_CXXFLAGS = @COURIER_UNICODE_CXXFLAGS@
The AX_COURIER_UNICODE_VERSION macro checks the minimum library version, which defaults to the build version. An optional parameter explicitly specifies which version of the Courier Unicode library is the minimum version required, i.e.:
AX_COURIER_UNICODE_VERSION(2.2.0)
AX_COURIER_UNICODE_CXXFLAGS sets COURIER_UNICODE_CXXFLAGS to the appropriate option for older gcc compilers that require an option to enable C++11 support. The starting point for the library documentation is courier-unicode 7. Refer to the included manual pages, and the HTML version of the man pages for more information.
Manual pages
C manual pages SamVarshavchikAuthorCourier Unicode Library courier-unicode 7 courier-unicode Courier Unicode Library #include <courier-unicode.h> DESCRIPTION This library implements several algorithms related to the Unicode Standard. This library uses iconv 3 to convert text in a given character set to unicode. Any character set displayed by iconv --list can be specified for the corresponding character set parameter. Additionally, courier-unicode.h defines a special character string unicode_x_imap_modutf7 that specifies the pseudo-character set for the modified-UTF7 encoding used in IMAP. This string can also be appended by a space, and up to fifteen additional US-ASCII characters. The resulting character set also encodes these additional characters, in addition to unicode characters, with modified-UTF7. The C++ compiler must have C++11 support. Minimum usable version of gcc appears to be gcc 4.4 with the -std=c++0x flag. Current versions of gcc use C++11, or higher, by default and do not require extra flags. Consult the packaging documentation for the Courier Unicode Library for information on any compiler flags that are needed to build software that links with this library. SEE ALSO unicode_bidi 3, unicode_canonical 3, unicode_category_lookup 3, unicode_convert 3, unicode_default_chset 3, unicode_emoji_lookup 3, unicode_html40ent_lookup 3, unicode_grapheme_break 3, unicode_line_break 3, unicode_script 3, unicode_uc 3, unicode_word_break 3, unicode::bidi 3, unicode::canonical 3, unicode::iconvert::convert 3, unicode::iconvert::convert_tocase 3, unicode::iconvert::fromu 3, unicode::iconvert::tou 3, unicode::tolower 3, unicode::linebreak 3, unicode::wordbreak 3. SamVarshavchikAuthorCourier Unicode Library unicode_bidi 3 unicode_bidi unicode_bidi_calc_levels unicode_bidi_calc_types unicode_bidi_calc unicode_bidi_reorder unicode_bidi_cleanup unicode_bidi_cleaned_size unicode_bidi_logical_order unicode_bidi_combinings unicode_bidi_needs_embed unicode_bidi_embed unicode_bidi_embed_paragraph_level unicode_bidi_direction unicode_bidi_type unicode_bidi_setbnl unicode_bidi_mirror unicode_bidi_bracket_type unicode bi-directional algorithm #include <courier-unicode.h> unicode_bidi_level_t lr=UNICODE_BIDI_LR; void unicode_bidi_calc_types const char32_t *p size_t n unicode_bidi_type_t *types struct unicode_bidi_direction unicode_bidi_calc_levels const char32_t *p const unicode_bidi_type_t *types size_t n unicode_bidi_level_t *levels const unicode_bidi_level_t *initial_embedding_level struct unicode_bidi_direction unicode_bidi_calc const char32_t *p size_t n unicode_bidi_level_t *levels const unicode_bidi_level_t *initial_embedding_level void unicode_bidi_reorder char32_t *string unicode_bidi_level_t *levels size_t n void (*reorder_callback)(size_t, size_t, void *) void *arg size_t unicode_bidi_cleanup char32_t *string unicode_bidi_level_t *levels size_t n int options void (*removed_callback)(size_t, size_t, void *) void *arg size_t unicode_bidi_cleaned_size const char32_t *string size_t n int options void unicode_bidi_logical_order char32_t *string unicode_bidi_level_t *levels size_t n unicode_bidi_level_t paragraph_embedding void (*reorder_callback)(size_t index, size_t n, void *arg) void *arg void unicode_bidi_combinings const char32_t *string const unicode_bidi_level_t *levels size_t n void (*combinings)(unicode_bidi_level_t level, size_t level_start, size_t n_chars, size_t comb_start, size_t n_comb_chars, void *arg) void *arg int unicode_bidi_needs_embed const char32_t *string const unicode_bidi_level_t *levels size_t n const unicode_bidi_level_t *paragraph_embedding size_t unicode_bidi_embed const char32_t *string const unicode_bidi_level_t *levels size_t n unicode_bidi_level_t paragraph_embedding void (*emit)(const char32_t *string, size_t n, int is_part_of_string, void *arg) void *arg char32_t unicode_bidi_embed_paragraph_level const char32_t *string size_t n unicode_bidi_level_t paragraph_embedding char32_t bidi_mirror char32_t c char32_t bidi_bracket_type char32_t c unicode_bracket_type_t *ret struct unicode_bidi_direction unicode_bidi_get_direction char32_t *c size_t n enum_bidi_type_t unicode_bidi_type char32_t c void unicode_bidi_setbnl char32_t *p const unicode_bidi_type_t *types size_t n DESCRIPTION These functions are related to the Unicode Bi-Directional algorithm. They implement the algorithm up to and including step L2, and provide additional functionality of returning miscellaneous bi-directional-related metadata of Unicode characters. There's also a basic algorithm that reverses the bi-directional algorithm and produces a Unicode string with bi-directional markers that results in the same bi-directional string after reapplying the algorithm. Calculating bi-directional rendering order The following process computes the rendering order of characters according to the Unicode Bi-Directional algorithm: Allocate an array of unicode_bidi_type_t that's the same size as the Unicode string. Allocate an array of unicode_bidi_level_t that's the same size as the Unicode string. Use unicode_bidi_calc_types() to compute the Unicode string's characters' bi-directional types, and populate the unicode_bidi_type_t buffer. Use unicode_bidi_calc_levels() to compute the Unicode string's characters' bi-directional embedding level (executes the Bi-Directional algorithm up to and including step L1). This populates the unicode_bidi_level_t buffer. Alternatively: allocate only the unicode_bidi_level_t array and use unicode_bidi_calc(), which malloc()s the unicode_bidi_type_t buffer, calls unicode_bidi_calc_levels(), and then free()s the buffer. Use unicode_bidi_reorder() to reverse any characters in the string, according to the algorithm (step L2), with an optional callback that reports which ranges of characters get reversed. Use unicode_bidi_cleanup() to remove the characters from the string which are used by the bi-directional algorithm, and are not needed for rendering the text. unicode_bidi_cleaned_size() is available to determine, in advance, how many characters will remain. The parameters to unicode_bidi_calc_types() are: A pointer to the Unicode string. Number of characters in the Unicode string. A pointer to an array of unicode_bidi_type_t values. The caller is responsible for allocating and deallocating this array, which has the same size as the Unicode string. The parameters to unicode_bidi_calc_levels() are: A pointer to the Unicode string. A pointer to the buffer that was passed to unicode_bidi_calc_types(). Number of characters in the Unicode string and the unicode_bidi_type_t buffer. A pointer to an array of unicode_bidi_level_t values. The caller is responsible for allocating and deallocating this array, which has the same size as the Unicode string. An optional pointer to a UNICODE_BIDI_LR or UNICODE_BIDI_RL value. This sets the default paragraph direction level. A null pointer computes the default paragraph direction level based on the string, as specified by the "P" rules of the bi-directional algorithm. The parameters to unicode_bidi_calc() are the same except for the unicode_bidi_type_t pointer. unicode_bidi_calc() allocates this buffer by itself and calls unicode_bidi_calc_types, and destroys the buffer before returning. unicode_bidi_calc() and unicode_bidi_calc_levels() fill in the unicode_bidi_level_t array with the values corresponding to the embedding level of the corresponding character, according the Unicode Bidirection Algorithm (even values for left-to-right ordering, and odd values for right-to-left ordering). A value of UNICODE_BIDI_SKIP designates directional markers (from step X9). unicode_bidi_calc() and unicode_bidi_calc_levels() return the resolved paragraph direction level, which always matches the passed in level, if specified, else it reports the derived one. These functions return a unicode_bidi_direction structure: struct unicode_bidi_direction { unicode_bidi_level_t direction; int is_explicit; }; direction gives the paragraph embedding level, UNICODE_BIDI_LR or UNICODE_BIDI_RL. is_explicit indicates whether: the optional pointer to a UNICODE_BIDI_LR or UNICODE_BIDI_RL value was specified (and returned in direction), or whether the direction comes from an character with an explicit direction indication. unicode_bidi_reorder() takes the actual unicode string together with the embedding values from unicode_bidi_calc or unicode_bidi_calc_levels(), then reverses the bi-directional string, as specified by step L2 of the bi-directional algorithm. The parameters to unicode_bidi_reorder() are: A pointer to the Unicode string. A pointer to an array of unicode_bidi_level_t values. Number of characters in the Unicode string and the unicode_bidi_level_t array. An optional reorder_callback function pointer. A non-NULL reorder_callback gets invoked to report each reversed character range. The callback's first parameter is the index of the first reversed character, the second parameter is the number of reversed characters, starting at the given index of the Unicode string. The third parameter is the arg passthrough parameter. unicode_bidi_reorder modifies its string and levels. reorder_callback gets invoked after reversing each consecutive range of values in the string and levels buffers. For example: reorder_callback(5, 7, arg) reports that character indexes #5 through #11 got reversed. A NULL string pointer leaves the levels buffer unchanged, but still invokes the reorder_callback as if the character string, and their embedding values, were reversed. The resulting string and embedding levels are in rendering order, but still contain bi-directional embedding, override, boundary-neutral, isolate, and marker characters. unicode_bidi_cleanup removes these characters and directional markers. The parameters to unicode_bidi_cleanup() are: The pointer to the unicode string. A non-null pointer to the directional embedding level buffer, of the same size as the string, also removes the corresponding values from the buffer, and the remaining values in the embedding level buffer get reset to levels UNICODE_BIDI_LR and UNICODE_BIDI_RL, only. The size of the unicode string and the directional embedding buffer (if not NULL). A a bitmask that selects the following options (or 0 if no options): UNICODE_BIDI_CLEANUP_EXTRA In addition to removing all embedding, override, and boundry-neutral characters as specified by step X9 of the bi-directional algorithm (the default behavior without this flag), also remove all isolation markers and implicit markers. UNICODE_BIDI_CLEANUP_BNL Replace all characters classified as paragraph separators with a newline character. UNICODE_BIDI_CLEANUP_CANONICAL A combined set of UNICODE_BIDI_CLEANUP_EXTRA and UNICODE_BIDI_CLEANUP_BNL, A pointer to a function that gets repeatedly invoked with the index of the character that gets removed from the Unicode string. An opaque pointer that gets forwarded to the callback. The function pointer (if not NULL) gets invoked to report the index of each removed character. The reported index is the index from the original string, and the callback gets invoked in strict order, from the first to the last removed character (if any). The character string and the embedding level values resulting from unicode_bidi_cleanup() with the UNICODE_BIDI_CLEANUP_CANONICAL are in canonical rendering order. unicode_bidi_logical_order(), unicode_bidi_needs_embed() and unicode_bidi_embed() require the canonical rendering order for their string and embedding level values. The parameters to unicode_bidi_cleaned_size() are a pointer to the unicode string, its size, and the bitmask option to unicode_bidi_cleanup(). Embedding bi-directional markers in Unicode text strings unicode_bidi_logical_order() rearranges the string from rendering to its logical order. unicode_bidi_embed() adds various bi-directional markers to a Unicode string in canonical rendering order. The resulting string is not guaranteed to be identical to the original Unicode bi-directional string. The algorithm is fairly basic, but the resulting bi-directional string produces the same canonical rendering order after applying unicode_bidi_calc() or unicode_bidi_calc_levels(), unicode_reorder() and unicode_bidi_cleanup() (with the canonical option), with the same paragraph_embedding level. unicode_bidi_needs_embed() attempts to heuristically determine whether unicode_bidi_embed() is required. unicode_bidi_logical_order() gets called first, followed by unicode_bidi_embed() (or unicode_bidi_needs_embed() in order to determine whether bi-directional markers are required). Finally, unicode_bidi_embed_paragraph_level() optionally determines whether the resulting string's default paragraph embedding level matches the one used for the actual embedding direction, and if not returns a directional marker to be prepended to the Unicode character string, as a hint. unicode_bidi_logical_order() factors in the characters' embedding values, and the provided paragraph embedding value (UNICODE_BIDI_LR or UNICODE_BIDI_RL), and rearranges the characters and the embedding levels in left-to-right order, while simultaneously invoking the supplied reorder_callback indicating each range of characters whose relative order gets reversed. The reorder_callback() receives, as parameters: The starting index of the first reversed character, in the string. Number of reversed characters. Forwarded arg pointer value. This specifies a consecutive range of characters (and directional embedding values) that get reversed (first character in the range becomes the last character, and the last character becomes the first character). After unicode_bidi_logical_order(), unicode_bidi_embed() progressively invokes the passed-in callback with the contents of a bi-directional unicode string. The parameters to unicode_bidi_embed() are: The Unicode string. The directional embedding buffer, in canonical rendering order. The size of the string and the embedding level buffer. The paragraph embedding level, either UNICODE_BIDI_LR or UNICODE_BIDI_RL. The pointer to the callback function. An opaque pointer argument that gets forwarded to the callback function. The callback receives pointers to various parts of the original string that gets passed to unicode_bidi_embed(), intermixed with bi-directional markers, overrides, and isolates. The callback's parameters are: The pointer to a Unicode string. It is not a given that the callback receives pointers to progressively increasing pointers of the original string that gets passed to unicode_bidi_embed(). Some calls will be for individual bi-directional markers, and unicode_bidi_embed() also performs some additional internal reordering, on the fly, after unicode_bidi_logical_order()'s big hammer. Number of characters in the Unicode string. Indication whether the Unicode string pointer is pointing to a part of the original Unicode string that's getting embedded. Otherwise this must be some marker character that's not present in the original Unicode string. Forwarded arg pointer value. The assembled unicode string should produce the same canonical rendering order, for the same paragraph embedding level. unicode_bidi_embed_paragraph_level() checks if the specified Unicode string computes the given default paragraph embedding level and returns 0 if it matches. Otherwise it returns a directional marker that should be prepended to the Unicode string to allow unicode_bidi_calc's (or unicode_bidi_calc_levels()) optional paragraph embedding level pointer's value to be NULL, but derive the same default embedding level. The parameters to unicode_bidi_embed_paragraph_level() are: The Unicode string. The size of the string. The paragraph embedding level, either UNICODE_BIDI_LR or UNICODE_BIDI_RL. unicode_bidi_needs_embed() attempts to heuristically determine whether the Unicode string, in logical order, requires bi-directional markers. The parameters to unicode_bidi_embed_paragraph_level() are: The Unicode string. The directional embedding buffer, in logical order. The size of the string and the embedding level buffer. A pointer to an explicit paragraph embedding level, either UNICODE_BIDI_LR or UNICODE_BIDI_RL; or a NULL pointer (see unicode_bidi_calc_types()'s explanation for this parameter). unicode_bidi_needs_embed() returns 0 if the Unicode string does not need explicit directional markers, or 1 if it does. This is done by using unicode_bidi_calc(), unicode_bidi_reorder(), unicode_bidi_logical_order and then checking if the end result is different from what was passed in. Combining character ranges unicode_bidi_combinings() reports consecutive sequences of one or more combining marks in bidirectional text (which can be either in rendering or logical order) that have the same embedding level. It takes the following parameters: The Unicode string. The directional embedding buffer, in logical or rendering order. A NULL value for this pointer is equivalent to a directional embedding buffer with a level of 0 for every character in the Unicode string. Number of characters in the Unicode string. The pointer to the callback function. An opaque pointer argument that gets forwarded to the callback function. The callback function gets invoked for every consecutive sequence of one or more characters that have a canonical combining class other than 0, and with the same embedding level. The parameters to the callback function are: The embedding level of the combining characters. The starting index of a consecutive sequence of all characters with the same embedding level. The number of characters with the same embedding level. The starting index of a consecutive sequence of all characters with the same embedding level and a canonical combining class other than 0. This will always be equal to or greater than the value of the second parameter. The number of consecutive characters with the characters with the same embedding level and a canonical combining class other than 0. The last character included in this sequence will always be less than or equal to the last character in the sequence defined by the second and the third parameters. The opaque pointer argument that was passed to unicode_bidi_combinings. A consecutive sequence of Unicode characters with non-0 combining classes but different embedding levels gets reported individually, for each consecutive sequence with the same embedding level. This function helps with reordering the combining characters in right-to-left-rendered text. Right-to-left text reversed by unicode_bidi_reorder() results in combining characters preceding their starter character. They get reversed no differently than any other character. The same thing also occurs after unicode_bidi_logical_order() reverses everything back. Use unicode_bidi_combinings to identify consecutive sequences of combining characters followed by their original starter. The callback may reorder the characters identified by its third and the fourth parameters in the manner described below. unicode_bidi_reorder's parameter is pointers to a constant Unicode string; but it can modify the string (via an out-of-band mutable pointer) subject to the following conditions: The characters identified by the third and the fourth parameter may be modified. If the last character in this sequence is not the last character included in the range specified by the first and the second character, then one more character after the last character may also be modified. This is, presumably, the original starter that preceded the combining characters before the entire sequence was reversed. Here's an example of a callback that reverses combining characters and their immediately-following starter character:
Miscellaneous utility functions unicode_bidi_get_direction takes a pointer to a unicode string, the number of characters in the unicode string, and determines default paragraph level level. unicode_bidi_get_direction returns a struct with the following fields: direction This value is either UNICODE_BIDI_LR or UNICODE_BIDI_RL (left to right or right to left). is_explicit This value is a flag. A non-0 value indicates that the embedding level was derived from an explicit character type (L, R or AL) from the stirng. A 0 value indicates the default paragraph direction, no explicit character was found in the string. unicode_bidi_type looks up each character's bi-directional character type. unicode_bidi_setbnl takes a pointer to a unicode string, a pointer to an array of enum_bidi_type_t values and the number of characters in the string and the array. unicode_bidi_setbnl replaces all paragraph separators in the unicode string with a newline character (same as the UNICODE_BIDI_CLEANUP_BNL option to unicode_bidi_cleanup. unicode_bidi_mirror returns the glyph that's a mirror image of the parameter (i.e. an open parenthesis for a close parenthesis, and vice versa); or the same value if there is no mirror image (this is the Bidi_Mirrored=Yes property). unicode_bidi_bracket_type looks up each bracket character and returns its opposite, or the same value if the character is not a bracket that has an opposing bracket character (this is the Bidi_Paired_Bracket_type property). A non-NULL ret gets initialized to either UNICODE_BIDI_o, UNICODE_BIDI_c or UNICODE_BIDI_n.
SEE ALSO TR-9, unicode::bidi 3, courier-unicode 7,
SamVarshavchikAuthorCourier Unicode Library unicode_canonical 3 unicode_canonical unicode_ccc unicode_decomposition_init unicode_decomposition_deinit unicode_decompose unicode_decompose_reallocate_size unicode_compose unicode_composition_init unicode_composition_deinit unicode_composition_apply unicode canonical normalization and denormalization #include <courier-unicode.h> unicode_canonical_t unicode_canonical char32_t c uint8_t unicode_ccc char32_t c void unicode_decomposition_init unicode_decomposition_t *info char32_t *string size_t *string_size void *arg int unicode_decompose unicode_decomposition_t *info void unicode_decomposition_deinit unicode_decomposition_t *info size_t unicode_decompose_reallocate_size unicode_decomposition_t *info const size_t *sizes size_t n int unicode_compose char32_t *string size_t string_size int flags size_t *new_size int unicode_composition_init const char32_t *string size_t string_size int flags unicode_composition_t *compositions void unicode_composition_deinit unicode_composition_t *compositions size_t unicode_composition_apply char32_t *string size_t string_size unicode_composition_t *compositions DESCRIPTION These functions compose or decompose a Unicode string into a canonical or a compatible normalized form. unicode_canonical() looks up the character's canonical and compatibility mapping. unicode_canonical() returns a structure with the following fields: canonical_chars A pointer to the canonical or equivalent representation of the character. n_canonical_chars Number of characters in the canonical_chars. format A value of UNICODE_CANONICAL_FMT_NONE indicates a canonical mapping, other values indicate a compatibility equivalent mapping. A NULL canonical_chars (with a 0 n_canonical_chars) indicates that the character has no canonical or compatibility equivalence. unicode_ccc() returns the character's canonical combining class value. unicode_decomposition_init(), unicode_decompose() and unicode_decomposition_deinit() implement a complete interface for decomposing a Unicode string:
unicode_decomposition_init() initializes a new unicode_decomposition_t structure, that gets passed in as its first parameter. The second parameter is a pointer to a Unicode string, with the number of characters in the string in the third parameter. A string size of -1 indicates a \0-terminated string and calculates its string_size (which does not include the trailing \0. The last parameter is a void *, an opaque pointer that gets stored in the initialized unicode_decomposition_t object:
typedef struct unicode_decomposition { char32_t *string; size_t string_size; int decompose_flags; int (*reallocate)( struct unicode_decomposition *info, const size_t *offsets, const size_t *sizes, size_t n ); void *arg; q } unicode_decomposition_t;
unicode_decompose() proceeds and decomposes the string and replaces it with its decomposed string version. unicode_decomposition_t's string, string_size and arg are copies of unicode_decomposition_init's parameters. unicode_decomposition_init initializes all other fields to their default values. The decompose_flags bitmask gets initialized to 0, and is a bit mask: UNICODE_DECOMPOSE_FLAG_QC Check each character's appropriate quick check property and skip decomposing Unicode characters that would get re-composed by unicode_composition_apply(). UNICODE_DECOMPOSE_FLAG_COMPAT Perform a compatibility decomposition instead of a canonical decomposition. reallocate is a pointer to a function that gets called to reallocate a larger string. unicode_decompose() determines which characters in the string need decomposing and calls the reallocate function pointer zero or more times. Each call to reallocate passes information about where new characters will get inserted into the string. reallocate only needs to grow the size of the buffer where string points so that it's big enough to hold a larger, decomposed string; then update string accordingly. reallocate should not update string_size or make any changes to the existing string, that's unicode_decompose()'s job (after reallocate returns). The reallocate callback function receives the following parameters. A pointer to the unicode_decomposition_t and, notably, its arg. A pointer to the array of offset indexes in the string where new characters will get inserted in order to hold the decomposed string. A pointer to the array that holds the number of characters that get inserted each corresponding offset. The size of the two arrays. reallocate must update the string if necessary to hold at least the number of characters that's the sum total of the initial string_size and the sum total of al sizes. unicode_decomposition_init() initializes the reallocate pointer to a default implementation that uses realloc 3 and updates string with its return value. The application can use its own reallocate to handle this task on its own, and use unicode_decompose_reallocate_size to compute the minimum string size:
string_size; for (i=0; i
The reallocate function returns 0 on success and a non-0 error code to report a failure; and unicode_decompose() does the same. The only error condition from unicode_decompose() is a non-0 error code from the reallocate function. Otherwise: a successful decomposition results in unicode_decompose() returning 0 and unicode_decomposition_init()'s string pointing to the decomposed string and string_size giving the number of characters in the decomposed string. string_size does not include the trailing \0 character. The input string also has its string_size specified without counting its \0 character. The default implementation of reallocate allocates an extra char32_t ands sets it to a \0. Therefore: If the Unicode string before decomposition has a trailing \0 and no decomposition occurs, and no calls to reallocate takes place: the string in the unicode_decomposition_t is unchanged and it's still \0-terminated. The default reallocate allocates an extra char32_t ands sets it to a \0; and it takes care of that for the decomposed string. An application that provides its own replacement reallocate is responsible for doing the same, if it wants the decomposed string to be \0 terminated. Multiple calls to the reallocate callback are possible. Each call to reallocate reflect the prior calls' decompositions. Example: the original string has five characters and the first call to reallocate had two offsets, at position 1 and 3, with a value of 1 for their both sizes. This effects transforming an original Unicode string "AAAAA" into "AXAAXAA" (with A representing unspecified characters in the original string, and X showing the two characters added in the first call to reallocate. A second call to varname with am offset at position 4, and a size of 1, results in the updated string of "AXAAYXAA" (with Y) marking an unspecified character inserted by the second call. Unicode string decomposition involves replacing a given Unicode character with one or more other characters. The sizes given to reallocate reflect the net addition to the Unicode string. For example: decomposing one Unicode character into three decomposed characters results in a call to reallocate reporting an insert of two more characters. offsets actually report the indices of each Unicode character that's getting decomposed. A 1:1 decomposition of a Unicode Character gets reported as an additional sizes entry of 0. unicode_decomposition_deinit() releases all resources and destroys the unicode_decomposition_t; it is no longer valid. unicode_decomposition_deinit() does not free 3 the string. The original string gets passed in to unicode_decomposition_init() and the decomposed string is left in the string. The default implementation of the reallocate function assumes the string is a malloc 3 -ed string, and reallocs it. At this time unicode_decomposition_deinit() does nothing. All code should explicitly call it in order to remain forward-compatible (at the source level). unicode_compose() performs a canonical composition of a decomposed string. Its parameters are: A pointer to the decomposed Unicode string. The number of characters in the Unicode string. The Unicode string does not need to be \0-terminated; if it is this number does not include it. A flags bitmask, which can have the following values: UNICODE_COMPOSE_FLAG_REMOVEUNUSED Remove all combining marks after doing all canonical compositions. Normally any unused combining marks are left in place, in the combined text. This option removes them. UNICODE_COMPOSE_FLAG_ONESHOT Perform canonical composition once per character, and do not attempt to combine any resulting combined characters again. A non-NULL pointer to a size_t. A successful composition sets this size_t to the number of characters in the combined string, and returns 0. The combined string gets placed back into the string parameter, this string gets combined in place and this gives the size of the combined string. unicode_compose() returns a non-zero value to indicate an error. unicode_composition_init(), unicode_composition_apply() and unicode_composition_deinit() implement a detailed interface for canonical composition of a decomposed Unicode string:
The first two parameters to both unicode_composition_init() and unicode_composition_apply() are the same: the Unicode string and the number of characters (not including any trailing \0 character) in the Unicode string. unicode_composition_init()'s additional parameters are: any optional flags (see unicode_compose() for a list of available flags), and the address of a unicode_composition_t object. A non-0 return from unicode_composition_init() indicates an error. unicode_composition_init() indicates success by returning 0 and initializing the unicode_composition_t's object which contains a pointer to an array of pointers to of unicode_compose_info objects, and the number of pointers. unicode_composition_init() does not change the string; the only thing it does is initialize the unicode_composition_t object. unicode_composition_apply() applies the compositions to the string, in place, and returns the new size of the string (also not including the \0 byte, however it does append one if the composed string is smaller, so the composed string is \0-terminated if the decomposed string was). It is necessary to call unicode_composition_deinit() to free all memory that was allocated for the unicode_composition_t object:
struct unicode_compose_info { size_t index; size_t n_composed; char32_t *composition; size_t n_composition; };   typedef struct { struct unicode_compose_info **compositions; size_t n_compositions; } unicode_composition_t;
index gives the character index in the string where each composition occurs. n_composed gives the number of characters in the original string that get composed. The composed characters are the composition; and n_composition gives the number of composed characters. Effectively: at the index position in the original string, #n_composed characters get removed and there are #n_composition characters that replace them (always n_composed or less). The UNICODE_COMPOSE_FLAG_REMOVEUNUSED flag has the effect of including the combining marks that did not get combined in the n_composed count. It's possible that, in this case, n_composition is 0. This indicates complete removal of the combining marks, without anything getting combined in their place. unicode_composition_init() sets unicode_composition_t's compositions pointer to an array of pointers to unicode_compose_infos that are sorted according to their index. n_compositions gives the number of pointers in the array, and is 0 if there are no compositions, the array is empty. The empty array gets interpreted accordingly when it gets passed to unicode_composition_apply() and unicode_composition_deinit(): nothing happens. unicode_composition_apply() simply returns the size of the unchanged string, and unicode_composition_deinit() does a pro-forma cleanup.
SEE ALSO TR-15, courier-unicode 7, unicode::canonical 3.
SamVarshavchikAuthorCourier Unicode Library unicode_category_lookup 3 unicode_category_lookup unicode_isalnum unicode_isalpha unicode_isblank unicode_isdigit unicode_isgraph unicode_islower unicode_ispunct unicode_isspace unicode_isupper unicode character categorization #include <courier-unicode.h> uint32_t unicode_category_lookup char32_t c int unicode_isalnum char32_t c int unicode_isalpha char32_t c int unicode_isblank char32_t c int unicode_isdigit char32_t c int unicode_isgraph char32_t c int unicode_islower char32_t c int unicode_ispunct char32_t c int unicode_isspace char32_t c int unicode_isupper char32_t c DESCRIPTION unicode_category_lookup() looks up the unicode character's categorization. unicode_category_lookup() returns a 32 bit value. The value's UNICODE_CATEGORY_1 bits specify the first level of the unicode character's category, with UNICODE_CATEGORY_2, UNICODE_CATEGORY_3, and UNICODE_CATEGORY_4 bits specifying the 2nd, 3rd, and 4th level, if given. A value of 0 for each corresponding bit set indicates that no category is specified for this level, for this character; otherwise the possible values are defined in <courier-unicode.h>. The remaining functions implement comparable equivalents of their non-unicode versions in the standard C library, as follows: unicode_isalnum() Returns non-0 for all unicode_isalpha() or unicode_isdigit(). unicode_isalpha() Returns non-0 for all UNICODE_CATEGORY_1_LETTER. unicode_isblank() Return non-0 for TAB, and all UNICODE_CATEGORY_2_SPACE. unicode_isdigit() Returns non-0 for all UNICODE_CATEGORY_1_NUMBER | UNICODE_CATEGORY_2_DIGIT, only (no third categories). unicode_isgraph() Returns non-0 for all codepoints above SPACE which are not unicode_isspace(). unicode_islower() Returns non-0 for all unicode_isalpha() for which the character is equal to unicode_lc 3 of itself. unicode_ispunct() Returns non-0 for all UNICODE_CATEGORY_1_PUNCTUATION. unicode_isspace() Returns non-0 for unicode_isblank() or for unicode characters with linebreaking properties of BK, CR, LF, NL, and SP. unicode_isupper() Returns non-0 for all unicode_isalpha() for which the character is equal to unicode_uc 3 of itself. SEE ALSO courier-unicode 7, unicode_convert_tocase 3. SamVarshavchikAuthorCourier Unicode Library unicode_convert 3 unicode_u_ucs4_native unicode_u_ucs2_native unicode_convert_init unicode_convert unicode_convert_deinit unicode_convert_tocbuf_init unicode_convert_tou_init unicode_convert_fromu_init unicode_convert_uc unicode_convert_tocbuf_toutf8_init unicode_convert_tocbuf_fromutf8_init unicode_convert_toutf8 unicode_convert_fromutf8 unicode_convert_tobuf unicode_convert_tou_tobuf unicode_convert_fromu_tobuf unicode character set conversion #include <courier-unicode.h> extern const char unicode_u_ucs4_native[]; extern const char unicode_u_ucs2_native[]; unicode_convert_handle_t unicode_convert_init const char *src_chset const char *dst_chset void *cb_arg int unicode_convert unicode_convert_handle_t handle const char *text size_t cnt int unicode_convert_deinit unicode_convert_handle_t handle int *errptr unicode_convert_handle_t unicode_convert_tocbuf_init const char *src_chset const char *dst_chset char **cbufptr_ret size_t *cbufsize_ret int nullterminate unicode_convert_handle_t unicode_convert_tocbuf_toutf8_init const char *src_chset char **cbufptr_ret size_t *cbufsize_ret int nullterminate unicode_convert_handle_t unicode_convert_tocbuf_fromutf8_init const char *dst_chset char **cbufptr_ret size_t *cbufsize_ret int nullterminate unicode_convert_handle_t unicode_convert_tou_init const char *src_chset char32_t **ucptr_ret size_t *ucsize_ret int nullterminate unicode_convert_handle_t unicode_convert_fromu_init const char *dst_chset char **cbufptr_ret size_t *cbufsize_ret int nullterminate int unicode_convert_uc unicode_convert_handle_t handle const char32_t *text size_t cnt char *unicode_convert_toutf8 const char *text const char *charset int *error char *unicode_convert_fromutf8 const char *text const char *charset int *error char *unicode_convert_tobuf const char *text const char *charset const char *dstcharset int *error int unicode_convert_toubuf const char *text size_t text_l const char *charset char32_t **uc size_t *ucsize int *error int unicode_convert_fromu_tobuf const char32_t *utext size_t utext_l const char *charset char **c size_t *csize int *error DESCRIPTION unicode_u_ucs4_native[] contains the string UCS-4BE or UCS-4LE, matching the native char32_t endianness. unicode_u_ucs2_native[] contains the string UCS-2BE or UCS-2LE, matching the native char32_t endianness. unicode_convert_init(), unicode_convert(), and unicode_convert_deinit() are an adaption of th iconv 3 API that uses the same calling convention as the other algorithms in this unicode library, with some value-added features. These functions use iconv 3 to effect the actual character set conversion. unicode_convert_init() returns a non-NULL handle for the requested conversion, or NULL if the requested conversion is not available. unicode_convert_init() takes a pointer to the output function that receives receives converted character text. The output function receives a pointer to the converted character text, and the number of characters in the converted text. The output function gets repeatedly called, until it receives the entire converted text. The character text to convert gets passed, repeatedly, to unicode_convert(). Each call to unicode_convert() results in the output function getting invoked, zero or more times, with each successive part of the converted text. Finally, unicode_convert_deinit() stops the conversion and deallocates the conversion handle. It's possible that a call to unicode_convert_deinit() results in some additional calls to the output function, passing the remaining, final parts, of the converted text, before unicode_convert_deinit() deallocates the handle, and returns. The output function should return 0 normally. A non-0 return indicates n error condition. unicode_convert_deinit() returns non-zero if any previous invocation of the output function returned non-zero (this includes any invocations of the output function resulting from this call, or prior unicode_convert() calls), or 0 if all invocations of the output function returned 0. If the errptr is not NULL, *errptr gets set to non-zero if there were any conversion errors -- if there was any text that could not be converted to the destination character text. unicode_convert() also returns non-zero if it calls the output function and it returns non-zero, however the conversion handle remains allocated, so unicode_convert_deinit() must still be called, to clean that up. Collecting converted text into a buffer Call unicode_convert_tocbuf_init() instead of unicode_convert_init(), then call unicode_convert() and unicode_convert_deinit() normally. The parameters to unicode_convert_init() specify the source and the destination character sets. unicode_convert_tocbuf_toutf8_init() is just an alias that specifies UTF-8 as the destination character set. unicode_convert_tocbuf_fromutf8_init() is just an alias that specifies UTF-8 as the source character st. These functions supply an output function that collects the converted text into a malloc()ed buffer. If unicode_convert_deinit() returns 0, *cbufptr_ret gets initialized to a malloc()ed buffer, and the number of converted characters, the size of the malloc()ed buffer, get placed into *cbufsize_ret. If the converted string is an empty string, *cbufsize_ret gets set to 0, but *cbufptr_ret still gets initialized (to a dummy malloced buffer). A non-zero nullterminate places a trailing \0 character after the converted string (this is included in *cbufsize_ret). Converting between character sets and unicode unicode_convert_tou_init() converts character text into a char32_t buffer. It works just like unicode_convert_tocbuf_init(), except that only the source character set gets specified and the output buffer is a char32_t buffer. nullterminate terminates the converted unicode characters with a U+0000. unicode_convert_fromu_init() converts char32_ts to the output character set, and also works like unicode_convert_tocbuf_init(). Additionally, in this case, unicode_convert_uc() works just like unicode_convert() except that the input sequence is a char32_t sequence, and the count parameter is th enumber of unicode characters. One-shot conversions unicode_convert_toutf8() converts the specified text in the specified text into a UTF-8 string, returning a malloced buffer. If error is not NULL, even if unicode_convert_toutf8() returns a non NULL value *error gets set to a non-zero value if a character conversion error has occurred, and some characters could not be converted. unicode_convert_fromutf8() does a similar conversion from UTF-8 text to the specified character set. unicode_convert_tobuf() does a similar conversion between two different character sets. unicode_convert_tou_tobuf() calls unicode_convert_tou_init(), feeds the character string through unicode_convert(), then calls unicode_convert_deinit(). If this function returns 0, *uc and *ucsize are set to a malloced buffer+size holding the unicode char array. unicode_convert_fromu_tobuf() calls unicode_convert_fromu_init(), feeds the unicode array through unicode_convert_uc(), then calls unicode_convert_deinit(). If this function returns 0, *c and *csize are set to a malloced buffer+size holding the char array. SEE ALSO courier-unicode 7, unicode_convert_tocase 3, unicode_default_chset 3. SamVarshavchikAuthorCourier Unicode Library unicode_default_chset 3 unicode_default_chset unicode_locale_chset return the system character set name #include <courier-unicode.h> const char *unicode_default_chset const char *unicode_locale_chset DESCRIPTION unicode_default_chset() returns the name of the system environment character set (usually nl_langinfo(CODESET), or from some suitable environment variable). unicode_locale_chset() returns the name of the current application locale's character set. SEE ALSO courier-unicode 7, unicode_convert_tocase 3. SamVarshavchikAuthorCourier Unicode Library unicode_emoji_lookup 3 unicode_emoji_lookup unicode_emoji unicode_emoji_presentation unicode_emoji_modifier unicode_emoji_modifier_base unicode_emoji_component unicode_emoji_extended_pictographic look up unicode character's Unicode Emoji Classification #include <courier-unicode.h> unicode_emoji_t unicode_emoji_lookup char32_t c int unicode_emoji char32_t c int unicode_emoji_presentation char32_t c int unicode_emoji_modifier char32_t c int unicode_emoji_modifier_base char32_t c int unicode_emoji_component char32_t c int unicode_emoji_extended_pictographic char32_t c DESCRIPTION unicode_emoji_lookup() returns the unicode emoji properties of the specified character, as a bitmask of UNICODE_EMOJI flags, as defined in the header file. unicode_emoji(), unicode_emoji_presentation(), unicode_emoji_modifier(), unicode_emoji_modifier_base(), unicode_emoji_component(), and unicode_emoji_extended_pictographic() check whether the given character carries a specific emoji property. They return 0 if not, and non-0 if the specified character has the corresponding property. SEE ALSO TR-51, courier-unicode 7. SamVarshavchikAuthorCourier Unicode Library unicode_html40ent_lookup 3 unicode_html40ent_lookup look up unicode character for an HTML 4.0 entity #include <courier-unicode.h> char32_t unicode_html40ent_lookup const char *entity DESCRIPTION unicode_html40ent_lookup() returns the unicode character represented by an HTML 4.0 entity. The entity is a string, such as quot, in which case unicode_html40ent_lookup() returns 34. Additionally, unicode_html40ent_lookup() parses a numerical entity given as #decimal or #xhex. unicode_html40ent_lookup() returns 0 if the entity is not a known entity that represents a single unicode character. SEE ALSO courier-unicode 7, unicode_convert_tocase 3. SamVarshavchikAuthorCourier Unicode Library unicode_grapheme_break 3 unicode_grapheme_break unicode_grapheme_break_init unicode_grapheme_break_next unicode_grapheme_break_deinit unicode grapheme cluster boundary rules #include <courier-unicode.h> unicode_grapheme_break_info_t unicode_grapheme_break_init int unicode_grapheme_next unicode_grapheme_break_info_t handle char32_t c void unicode_grapheme_deinit unicode_grapheme_break_info_t handle int unicode_grapheme_break char32_t a char32_t b DESCRIPTION These functions implement the unicode grapheme cluster breaking algorithm. Invoke unicode_grapheme_break_init() to initialize the grapheme cluster breaking algorithm. unicode_grapheme_break_init() returns an opaque handle. Each subsequent call to unicode_grapheme_break_next() passes this handle, and the next character. unicode_grapheme_break_next() returns a non-0 value if there's a grapheme break before the character, in a sequence of Unicode characters. unicode_grapheme_break_deinit() releases all reosurces used by the grapheme breaking handle, and the unicode_grapheme_break_info_t handle is no longer valid after this call. The first call to unicode_grapheme_break_next() always returns non-0, as per the GB1 rule. unicode_grapheme_break() is a simplified interface that returns non-zero if there is a grapheme break between two unicode characters a and b. This is is equivalent to calling unicode_grapheme_break_init(), followed by two calls to unicode_grapheme_break_next(), and finally unicode_grapheme_break_deinit(), then returning the result of the second call to unicode_grapheme_break_next(). SEE ALSO TR-29, courier-unicode 7, unicode_convert_tocase 3, unicode_line_break 3, unicode_word_break 3. SamVarshavchikAuthorCourier Unicode Library unicode_line_break 3 unicode_line_break unicode_lb_init unicode_lb_set_opts unicode_lb_next unicode_lb_next_cnt unicode_lb_end unicode_lbc_init unicode_lbc_set_opts unicode_lbc_next unicode_lbc_next_cnt unicode_lbc_end calculate mandatory or allowed line breaks #include <courier-unicode.h> unicode_lb_info_t unicode_lb_init int (*cb_func)(int, void *) void *cb_arg void unicode_lb_set_opts unicode_lb_info_t lb int opts int unicode_lb_next unicode_lb_info_t lb char32_t c int unicode_lb_next_cnt unicode_lb_info_t lb const char32_t *cptr size_t cnt int unicode_lb_end unicode_lb_info_t lb unicode_lbc_info_t unicode_lbc_init int (*cb_func)(int, char32_t, void *) void *cb_arg void unicode_lbc_set_opts unicode_lbc_info_t lb int opts int unicode_lbc_next unicode_lb_info_t lb char32_t c int unicode_lbc_next_cnt unicode_lb_info_t lb const char32_t *cptr size_t cnt int unicode_lbc_end unicode_lb_info_t lb DESCRIPTION These functions implement the unicode line breaking algorithm. Invoke unicode_lb_init() to initialize the line breaking algorithm. The first parameter is a callback function. The second parameter is an opaque pointer. The callback function gets invoked with two parameters. The first parameter is one of three values: UNICODE_LB_MANDATORY, UNICODE_LB_NONE, or UNICODE_LB_ALLOWED, as described below. The second parameter is the opaque pointer that was passed to unicode_lb_init(); the opaque pointer is not subject to any further interpretation by these functions. unicode_lb_init() returns an opaque handle. Repeated invocations of unicode_lb_next(), passing the handle and one unicode character at a time, defines a sequence of unicode characters over which the line breaking algorithm calculation takes place. unicode_lb_next_cnt() is a shortcut for invoking unicode_lb_next() repeatedly over an array cptr containing cnt unicode characters. unicode_lb_end() denotes the end of the unicode character sequence. After the call to unicode_lb_end() the line breaking unicode_lb_info_t handle is no longer valid. Between the call to unicode_lb_init() and unicode_lb_end(), the callback function gets invoked exactly once for each unicode character given to unicode_lb_next() or unicode_lb_next_cnt(). Usually each call to unicode_lb_next() results in the callback function getting invoked immediately, but it does not have to be. It's possible that a call to unicode_lb_next() returns without invoking the callback function, and some subsequent call to unicode_lb_next() (or unicode_lb_end()) invokes the callback function more than once, to catch up. The contract is that before unicode_lb_end() returns, the callback function gets invoked the exact number of times as the number of characters in the unicode sequence defined by the intervening calls to unicode_lb_next() and unicode_lb_next_cnt(), unless an error occurs. Each call to the callback function reports the calculated line breaking status of the corresponding character in the unicode character sequence: UNICODE_LB_MANDATORY A line break is MANDATORY before the corresponding character. UNICODE_LB_NONE A line break is PROHIBITED before the corresponding character. UNICODE_LB_ALLOWED A line break is OPTIONAL before the corresponding character. The callback function should return 0. A non-zero value indicates to the line breaking algorithm that an error has occurred. unicode_lb_next() and unicode_lb_next_cnt() return zero either if they never invoked the callback function, or if each call to the callback function returned zero. A non zero return from the callback function results in unicode_lb_next() and unicode_lb_next_cnt() immediately returning the same value. unicode_lb_end() must be invoked to destroy the line breaking handle even if unicode_lb_next() and unicode_lb_next_cnt() returned an error indication. It's also possible that, under normal circumstances, unicode_lb_end() invokes the callback function one or more times. The return value from unicode_lb_end() has the same meaning as from unicode_lb_next() and unicode_lb_next_cnt(); however in all cases after unicode_lb_end() returns the line breaking handle is no longer valid. Alternative callback function unicode_lbc_init(), unicode_lbc_next(), unicode_lbc_next_cnt(), unicode_lbc_end() are alternative functions that implement the same algorithm. The only difference is that the callback function receives an extra parameter, the unicode character value to which the line breaking status applies to, passed through from the input unicode character sequence. Options unicode_lb_set_opts() and unicode_lbc_set_opts() enable non-default options for the line breaking algorithm. These functions must be called immediately after unicode_lb_init() or unicode_lbc_init(), and before any other function. opts is a bitmask that can contain the following values: UNICODE_LB_OPT_PRBREAK Enables a modified LB24 rule. This prevents plus signs, as in C++ from breaking. This flag adds the following rules to the LB24 rule:
PR x PR AL x PR ID x PR
UNICODE_LB_OPT_SYBREAK Tailored breaking rules for the / character. This prevents breaking after the / character (think URLs); including an exception to the x SY rule in LB13. This flag adds the following rules to the LB24 rule:
SY x EX SY x AL SY x ID SP ÷ SY, which takes precedence over "x SY".
UNICODE_LB_OPT_DASHWJ This flag reclassifies U+2013 and U+2014 as class WJ, prohibiting breaks before and after the m-dash and the n-dash unicode characters.
SEE ALSO courier-unicode 7, unicode::linebreak 3, TR-14
SamVarshavchikAuthorCourier Unicode Library unicode_script 3 unicode_script unicode script property #include <courier-unicode.h> unicode_script_t unicode_script char32_t ch DESCRIPTION unicode_script() looks up the script property of the specified unicode character, and returns it. The unicode_script_t enumeration encodes possible unicode script values. unicode_script_unknown gets returned for a unicode character with an unknown script property. SEE ALSO TR-24, courier-unicode 7. SamVarshavchikAuthorCourier Unicode Library unicode_word_break 3 unicode_wb_init unicode_wb_next unicode_wb_next_cnt unicode_wb_end unicode_wbscan_init unicode_wbscan_next unicode_wbscan_end unicode_word_break calculate word breaks #include <courier-unicode.h> unicode_wb_info_t unicode_wb_init int (*cb_func)(int, void *) void *cb_arg int unicode_wb_next unicode_wb_info_t wb char32_t c int unicode_wb_next_cnt unicode_wb_info_t wb const char32_t *cptr size_t cnt int unicode_wb_end unicode_wb_info_t wb unicode_wbscan_info_t unicode_wbscan_init int unicode_wbscan_next unicode_wbscan_info_t wbs char32_t c size_t unicode_wbscan_end unicode_wbscan_info_t wbs DESCRIPTION These functions implement the unicode word breaking algorithm. Invoke unicode_wb_init() to initialize the word breaking algorithm. The first parameter is a callback function. The second parameter is an opaque pointer. The callback function gets invoked with two parameters. The second parameter is the opaque pointer that was given to unicode_wb_init(); and the opaque pointer is not subject to any further interpretation by these functions. unicode_wb_init() returns an opaque handle. Repeated invocations of unicode_wb_next(), passing the handle, and one unicode character defines a sequence of unicode characters over which the word breaking algorithm calculation takes place. unicode_wb_next_cnt() is a shortcut for invoking unicode_wb_next() repeatedly over an array cptr containing cnt unicode characters. unicode_wb_end() denotes the end of the unicode character sequence. After the call to unicode_wb_end() the word breaking unicode_wb_info_t handle is no longer valid. Between the call to unicode_wb_init() and unicode_wb_end(), the callback function gets invoked exactly once for each unicode character given to unicode_wb_next() or unicode_wb_next_cnt(). Usually each call to unicode_wb_next() results in the callback function getting invoked immediately, but it does not have to be. It's possible that a call to unicode_wb_next() returns without invoking the callback function, and some subsequent call to unicode_wb_next() (or unicode_wb_end()) invokes the callback function more than once, to catch things up. The contract is that before unicode_wb_end() returns, the callback function gets invoked the exact number of times as the number of characters in the unicode sequence defined by the intervening calls to unicode_wb_next() and unicode_wb_next_cnt(), unless an error occurs. Each call to the callback function reports the calculated wordbreaking status of the corresponding character in the unicode character sequence. If the parameter to the callback function is non zero, a word break is permitted before the corresponding character. A zero value indicates that a word break is prohibited before the corresponding character. The callback function should return 0. A non-zero value indicates to the word breaking algorithm that an error has occurred. unicode_wb_next() and unicode_wb_next_cnt() return zero either if they never invoked the callback function, or if each call to the callback function returned zero. A non zero return from the callback function results in unicode_wb_next() and unicode_wb_next_cnt() immediately returning the same value. unicode_wb_end() must be invoked to destroy the word breaking handle even if unicode_wb_next() and unicode_wb_next_cnt() returned an error indication. It's also possible that, under normal circumstances, unicode_wb_end() invokes the callback function one or more times. The return value from unicode_wb_end() has the same meaning as from unicode_wb_next() and unicode_wb_next_cnt(); however in all cases after unicode_wb_end() returns the line breaking handle is no longer valid. Word scan unicode_wbscan_init(), unicode_wbscan_next() and unicode_wbscan_end scan for the next word boundary in a unicode character sequence. unicode_wbscan_init() obtains a handle, then unicode_wbscan_next() gets repeatedly invoked to define the unicode character sequence. unicode_wbscan_end() deallocates the handle and returns the number of leading characters in the unicode character sequence up to the first word break. A non-0 return value from unicode_wbscan_next() indicates that the word boundary is already known, and any further calls to unicode_wbscan_next() will be ignored. unicode_wbscan_end() must still be called, to obtain the unicode character count. SEE ALSO TR-29, courier-unicode 7, unicode::wordbreak 3, unicode_convert_tocase 3, unicode_line_break 3, unicode_grapheme_break 3. SamVarshavchikAuthorCourier Unicode Library unicode_uc 3 unicode_uc unicode_lc unicode_tc unicode_convert_tocase unicode uppercase, lowercase, and titlecase character lookup #include <courier-unicode.h> char32_t unicode_uc char32_t c char32_t unicode_lc char32_t c char32_t unicode_tc char32_t c char *unicode_convert_tocase const char *str const char *charset char32_t (*first_char_func)(uncode_char) char32_t (*char_func)(uncode_char) DESCRIPTION unicode_uc(), unicode_lc(), unicode_tc() return the uppercase, lowercase, or the titlecase equivalent of the unicode character c. If this character does not have an uppercase, lowercase, or a titlecase equivalent, these functions return c, the same character. unicode_convert_tocase() takes the string str in the character set charset. first_char_func and char_func, each, should be unicode_uc, unicode_lc, or unicode_tc. unicode_convert_tocase() returns a malloc()ed buffer. The first unicode character in str gets processed by first_char_func, and all other characters by char_func. SEE ALSO courier-unicode 7, unicode_convert 3, unicode_default_chset 3, unicode_html40ent_lookup 3, unicode_category_lookup 3, unicode_grapheme_break 3, unicode_word_break 3, unicode_line_break 3.
C++ manual pages SamVarshavchikAuthorCourier Unicode Library unicode::bidi 3 unicode::bidi unicode::bidi_calc unicode::bidi_calc_types unicode::bidi_reorder unicode::bidi_cleanup unicode::bidi_logical_order unicode::bidi_combinings unicode::bidi_needs_embed unicode::bidi_embed unicode::bidi_embed_paragraph_level unicode::bidi_get_direction unicode::bidi_override unicode bi-directional algorithm #include <courier-unicode.h> struct unicode::bidi_calc_types bidi_calc_types const std::u32string &string std::vector<unicode_bidi_type_t> types setbnl std::u32string & string std::tuple<std::vector<unicode_bidi_level_t>, struct unicode_bidi_direction> unicode::bidi_calc const unicode::bidi_calc_types &ustring std::tuple<std::vector<unicode_bidi_level_t>, struct unicode_bidi_direction> unicode::bidi_calc const unicode::bidi_calc_types &ustring unicode_bidi_level_t embedding_level int unicode::bidi_reorder std::u32string &string std::vector<unicode_bidi_level_t> &embedding_level const std::function<void (size_t, size_t)> &reorder_callback=[](size_t, size_t){} size_t starting_pos=0 size_t n=(size_t)-1 void unicode::bidi_reorder std::vector<unicode_bidi_level_t> &embedding_level const std::function<void (size_t, size_t)> &reorder_callback=[](size_t, size_t){} size_t starting_pos=0 size_t n=(size_t)-1 void unicode::bidi_cleanup std::u32string &string const std::function<void (size_t)> &removed_callback=[](size_t){} int cleanup_options int unicode::bidi_cleanup std::u32string &string std::vector <unicode_bidi_level_t> &levels const std::function<void (size_t)> &removed_callback=[](size_t){} int cleanup_options=0 int unicode::bidi_cleanup std::u32string &string std::vector <unicode_bidi_level_t> &levels const std::function<void (size_t)> &removed_callback int cleanup_options size_t starting_pos size_t n int unicode::bidi_logical_order std::u32string &string std::vector <unicode_bidi_level_t> &levels unicode_bidi_level_t paragraph_embedding const std::function<void (size_t, size_t)> &reorder_callback=[](size_t, size_t){} size_t starting_pos=0 size_t n=(size_t)-1 void unicode::bidi_combinings const std::u32string &string const std::vector <unicode_bidi_level_t> &levels const std::function <void (unicode_bidi_level_t level, size_t level_start, size_t n_chars, size_t comb_start, size_t n_comb_chars)> &callback void unicode::bidi_combinings const std::u32string &string const std::function <void (unicode_bidi_level_t level, size_t level_start, size_t n_chars, size_t comb_start, size_t n_comb_chars)> &callback void unicode::bidi_logical_order std::vector <unicode_bidi_level_t> &levels unicode_bidi_level_t paragraph_embedding const std::function<void (size_t, size_t)> &reorder_callback size_t starting_pos=0 size_t n=(size_t)-1 bool unicode::bidi_needs_embed const std::u32string &string const std::vector <unicode_bidi_level_t> &levels const unicode_bidi_level_t (paragraph_embedding=NULL size_t starting_pos=0 size_t n=(size_t)-1 int unicode::bidi_embed const std::u32string &string const std::vector <unicode_bidi_level_t> &levels unicode_bidi_level_t paragraph_embedding const std::function<void (const char32_t *, size_t, bool)> &callback std::u32string unicode::bidi_embed const std::u32string &string const std::vector <unicode_bidi_level_t> &levels unicode_bidi_level_t paragraph_embedding char32_t unicode_bidi_embed_paragraph_level const std::u32string &string unicode_bidi_level_t paragraph_embedding unicode_bidi_direction bidi_get_direction const std::u32string &string size_t starting_pos=0 size_t n=(size_t)-1 std::u32string bidi_override const std::u32string &string unicode_bidi_level_t direction int cleanup_options=0 DESCRIPTION These functions implement the C++ interface for the Unicode Bi-Directional algorithm. See the description of the underlying unicode_bidi 3 C library API for more information. C++ specific notes: unicode::bidi_calc returns the directional embedding value buffer and the calculated paragraph embedding level. Its ustring is implicitly converted from a std::u32string:
Alternatively a unicode::bidi_calc_types objects gets constructed from the same std::u32string and then passed directly to unicode::bidi_calc:
This provides the means to access the intermediate enum_bidi_types_t values that get calculated from the Unicode text string. In all cases the std::u32string cannot be a temporary object, and it must remain in scope until unicode::bidi_calc() returns. The optional setbnl() method uses unicode_bidi_setbnl 3 to replace paragraph separators with newline characters, in the unicode string. It requires the same unicode string that was passed to the constructor as a parameter (because the constructor takes a constant reference, but this method modifies the string.
Several C functions provide a dry-run mode by passing a NULL pointer. The C++ API provides separate overloads, with and without the nullable parameter. Several C functions accept a nullable function pointer, with the NULL function pointer specifying no callback. The C++ functions have a std::function parameter with a default do-nothing closure. Several C functions accept two parameters, a Unicode character pointer and the embedding level buffer, and a single parameter that specifies the size of both. The equivalent C++ function takes two discrete parameters, a std::u32string and a std::vector and returns an int; a negative value if their sizes differ, and 0 if their sizes match, and the requested function completes. The unicode::bidi_embed overload that returns a std::u32string returns an empty string in case of a mismatch. unicode::bidi_reorder reorders the entire string and its embedding_levels by default. The optional starting_pos and n parameters limit the reordering to the indicated subset of the original string (specified as the starting position offset index, and the number of characters). unicode::bidi_reorder, unicode::bidi_cleanup, unicode::bidi_logical_order, unicode::bidi_needs_embed and unicode::bidi_get_direction take two optional parameters (defaulted values or overloaded) specifying an optional starting position and number of characters that define a subset of the original string that gets reordered, cleaned up, or has its direction determined. This unicode::bidi_cleanup does not trim off the passed in string and embedding level buffer, since it affects only a subset of the string. The number of times the removed character callback gets invoked indicates how much the substring should be trimmed off. unicode::bidi_override modifies the passed-in string as follows: unicode::bidi_cleanup() is applied with the specified, or defaulted, cleanup_options Either the LRO or an RLO override marker gets prepended to the Unicode string, forcing the entire string to be interpreted in a single rendering direction, when processed by the Unicode bi-directional algorithm. unicode::bidi_override makes it possible to use a Unicode-aware application or algorithm in a context that only works with text that's always displayed in a fixed direction, allowing graceful handling of input containing bi-directional text.
<literal>unicode::literals</literal> namespace
This namespace contains the following constexpr definitions: char32_t arrays with literal Unicode character strings containing Unicode directional, isolate, and override markers, like LRO, RLO and others. CLEANUP_EXTRA, CLEANUP_BNL, and CLEANUP_CANONICAL options for unicode::bidi_cleanup().
SEE ALSO courier-unicode 7, unicode_bidi 3.
SamVarshavchikAuthorCourier Unicode Library unicode::canonical 3 unicode::canonical unicode::decompose unicode::decompose_default_reallocate unicode::compose unicode::compose_default_callback unicode canonical normalization and denormalization #include <courier-unicode.h> constexpr int decompose_flag_qc=UNICODE_DECOMPOSE_FLAG_QC; constexpr int decompose_flag_compat=UNICODE_DECOMPOSE_FLAG_COMPAT; constexpr int compose_flag_removeunused=UNICODE_COMPOSE_FLAG_REMOVEUNUSED; constexpr int compose_flag_oneshot=UNICODE_COMPOSE_FLAG_ONESHOT; void decompose_default_reallocate std::u32string &string const std::vector<std::tuple<size_t, size_t>> &list void decompose std::u32string &string int flags=0 const std::function<void (std::u32string &, const std::vector<std::tuple<size_t, size_t>>)> &reallocate=decompose_default_reallocate void compose_default_callback unicode_composition_t &compositions void compose std::u32string &string int flags=0 const std::function<void (unicode_composition_t &)> &cb=compose_default_reallocate DESCRIPTION These functions implement the C++ interface for the Unicode Canonical Decomposition and Composition, See the description of the underlying unicode_canonical 3 C library API for more information. C++ specific notes: The C++ decomposition reallocate callback receives a single vector of offset and size tuples instead of two separate arrays or vectors. unicode::decompose_default_reallocate() is the C++ version of the default reallocate callback. It receives the receiving the same tuple vector parameter, too. The C++ interface use std::u32strings to represent Unicode text strings, and unicode::decompose_default_reallocate() resizes it. Like the C callback, the C++ one gets called 0 or more times. unicode::compose() takes care of initializing, applying, and de-initialization the unicode_composition_t object, for decomposition. The callback receives a reference to the unicode_composition_t object, which the callback should not modify in any way. SEE ALSO courier-unicode 7, unicode_canonical 3. SamVarshavchikAuthorCourier Unicode Library unicode::iconvert::convert 3 unicode::iconvert::convert unicode::ucs_4 unicode::ucs_2 unicode::utf_8 unicode::iso_8859_1 unicode character set conversion #include <courier-unicode.h> extern const char unicode::ucs_4[]; extern const char unicode::ucs_2[]; extern const char unicode::utf_8[]; extern const char unicode::iso_8859_1[]; std::string unicode::iconvert::convert const std::string &text const std::string &srccharset const std::string &dstcharset std::string unicode::iconvert::convert const std::string &text const std::string &srccharset const std::string &dstcharset bool &errflag std::string unicode::iconvert::convert const std::vector<char32_t> &text const std::string &dstcharset std::string unicode::iconvert::convert const std::vector<char32_t> &text const std::string &dstcharset bool &errflag bool unicode::iconvert::convert const std::string &text const std::string &charset std::vector<char32_t> &text DESCRIPTION The overloaded unicode::convert::convert() functions convert: A text string between two different character sets, returning the new string. A vector of unicode characters (not null-terminated) to a character string in a supported character set. Initialize a vector of unicode characters, passed by reference, by converting a text string in a given character set to unicode. These functions use iconv 3, and can use any character set that's supported by iconv 3. Use unicode::ucs_2 and unicode::ucs_4 to specify the 16 and the 32 bit unicode octet in native byte order. Use unicode::utf_8 and unicode::iso_8859_1 to specify these two standard character sets. The overloaded versions that pass a reference to a bool set the flag to true if some characters could not be converted. The overloaded version that initializes a unicode vector returns the bool flag, instead. SEE ALSO courier-unicode 7, unicode::convert::convert_tocase 3, unicode_convert 3, iconv 3. SamVarshavchikAuthorCourier Unicode Library unicode::iconvert::convert_tocase 3 unicode::iconvert::convert_tocase unicode uppercase, lowercase, and titlecase conversion #include <courier-unicode.h> std::string unicode::iconvert::convert_tocase const std::string &text const std::string &charset char32_t (*first_char_func)(char32_t) char32_t (*char_func)(char32_t) std::string unicode::iconvert::convert_tocase const std::string &text const std::string &charset bool &err char32_t (*first_char_func)(char32_t) char32_t (*char_func)(char32_t) DESCRIPTION The overloaded unicode::convert::convert_tocase() function converts the text parameter, in the charset characters to lowercase, uppercase, and titlecase. text gets converted, internally, into unicode. first_char_func and char_func are either: unicode_lc, unicode_uc, or unicode_tc. If the converted text string is not empty, first_char_func converts the first unicode character in the text string, and char_func converts any remaining characters. unicode_lc converts its character to lowercase, unicode_uc to uppercase, and unicode_tc to titlecase. Finally, the unicode string gets converted back to charset, which gets returned. The optional err parameter gets set to true if an error was encounted converting the text string to or from unicode. SEE ALSO courier-unicode 7, unicode::convert::convert 3, unicode_convert 3, iconv 3. SamVarshavchikAuthorCourier Unicode Library unicode::iconvert::fromu 3 unicode::iconvert::fromu template for converting text sequence from unicode #include <courier-unicode.h> output_iter_t unicode::iconvert::fromu::convert input_iter_t beg_iter input_iter_t end_iter const std::string &charset output_iter_t output_iter bool &errflag void unicode::iconvert::fromu::convert input_iter_t beg_iter input_iter_t end_iter const std::string &charset std::string &out_buf bool &errflag std::pair<std::string, bool> unicode::iconvert::fromu::convert const std::u32string &text const std::string &charset DESCRIPTION These template functions convert unicode characters to text in the given character set. beg_iter and end_iter define an input sequence of char32_ts. They get converted to unicode characters. output_iter is an output iterator that convert() iterates over chars in the specified character set. convert() returns the value of the output iterator after iterating over the converted character sequence. err_flag gets set to true if unicode text could not be converted to the requested character set, or false for a successful conversion. An overloaded convert() puts the text string into a std::string, instead of using an output iterator. Finally, a single std::u32string specifies the character string, instead of a beginning and an ending iterator. SEE ALSO courier-unicode 7, unicode::convert::convert 3, unicode_convert 3, iconv 3. SamVarshavchikAuthorCourier Unicode Library unicode::iconvert::tou 3 unicode::iconvert::tou template for converting text sequence to unicode #include <courier-unicode.h> output_iter_t convert input_iter_t beg_iter input_iter_t end_iter const std::string &charset bool &errflag output_iter_t output_iter bool convert input_iter_t beg_iter input_iter_t end_iter const std::string &charset std::u32string &out_buf std::pair<std::u32string, bool> convert const std::string &text const std::string &charset DESCRIPTION These template functions convert text in a given character set to unicode characters. beg_iter and end_iter define an input sequence of chars in the charset character set. They get converted to unicode characters. output_iter is an output iterator that convert() iterates over char32_ts. convert() returns the value of the output iterator after iterating over the converted character sequence. errflag, passed by reference, gets set to true if some character could not be converted to unicode, from the specified character set, and false if the conversion completed without errors. An overloaded convert() puts the unicode character sequence into a vector of char32_ts, instead of an output sequence, and returned the error flag. Finally, a single std::string specifies the character string, instead of a beginning and an ending iterator, and returns a std::pair with the converted unicode text in a vector, and the error flag. SEE ALSO courier-unicode 7, unicode::convert::convert 3, unicode_convert 3, iconv 3. SamVarshavchikAuthorCourier Unicode Library unicode::linebreak 3 unicode::linebreak_callback_base unicode::linebreak_callback_save_buf unicode::linebreakc_callback_base unicode::linebreak_iter unicode::linebreakc_iter unicode line-breaking rules #include <courier-unicode.h> class linebreak : public unicode::linebreak_callback_base { public: using unicode::linebreak_callback_base::operator<<; using unicode::linebreak_callback_base::operator(); int callback(int linebreak_code) { // ... } }; char32_t c; std::u32string buf; linebreak compute_linebreak; compute_linebreak.set_opts(UNICODE_LB_OPT_SYBREAK); compute_linebreak << c; compute_linebreak(buf); compute_linebreak(buf.begin(), buf.end()); compute_linebreak.finish(); // ... unicode::linebreak_callback_save_buf linebreaks; std::list<int> lb=linebreaks.lb_buf; class linebreakc : public unicode::linebreakc_callback_base { public: using unicode::linebreak_callback_base::operator<<; using unicode::linebreak_callback_base::operator(); int callback(int linebreak_code, char32_t ch) { // ... } }; // ... std::u32string buf; typedef unicode::linebreak_iter<std::u32string::const_iterator> iter_t; iter_t beg_iter(buf.begin(), buf.end()), end_iter; beg_iter.set_opts(UNICODE_LB_OPT_SYBREAK); std::vector<int> linebreaks; std::copy(beg_iter, end_iter, std::back_insert_iterator<std::vector<int>>(linebreaks)); // ... typedef unicode::linebreakc_iter<std::u32string::const_iterator> iter_t; iter_t beg_iter(buf.begin(), buf.end()), end_iter; beg_iter.set_opts(UNICODE_LB_OPT_SYBREAK); std::vector<std::pair<int, char32_t>> linebreaks; std::copy(beg_iter, end_iter, std::back_insert_iterator<std::vector<int>>(linebreaks)); DESCRIPTION unicode::linebreak_callback_base is a C++ binding for the unicode line-breaking rule implementation described in unicode_line_break 3. Subclass unicode::linebreak_callback_base and implement callback() that's virtually inherited from unicode::linebreak_callback_base. The callback() callback function receives the output values from the line-breaking algorithm, the UNICODE_LB_MANDATORY, UNICODE_LB_NONE, or the UNICODE_LB_ALLOWED value, for each unicode character. callback() should return 0. A non-zero return reports an error, that stops the line-breaking algorithm. See unicode_line_break 3 for more information. The alternate unicode::linebreakc_callback_base interface uses a virtually inherited callback() that receives two parameters, the line-break code value, and the corresponding unicode character. The input unicode characters for the line-breaking algorithm are provided by the << operator, one unicode character at a time; or by the () operator, passing either a container, or a beginning and an ending iterator value for an input sequence of unicode characters. finish() indicates the end of the unicode character sequence. set_opts sets line-breaking options (see unicode_lb_set_opts() for more information). unicode::linebreak_callback_save_buf is a subclass that implements callback() by saving the linebreaks codes into a std::list. The linebreak_iter template implements an input iterator over ints. The template parameter is an input iterator over unicode chars. The constructor's parameters are a beginning and an ending iterator value for a sequence of char32_t. This constructs the beginning iterator value for a sequence of ints consisting of line-break values (UNICODE_LB_MANDATORY, UNICODE_LB_NONE, or UNICODE_LB_ALLOWED) corresponding to each char32_t in the underlying sequence. The default constructor creates the ending iterator value for the sequence. The iterator implements a set_opts() methods that sets the options for the line-breaking algorithm. The linebreakc_iter template implements a similar input iterator, with the difference that it ends up iterating over a std::pair of line-breaking values and the corresponding char32_t from the underlying input sequence. SEE ALSO courier-unicode 7, unicode_line_break 3. SamVarshavchikAuthorCourier Unicode Library unicode::tolower 3 unicode::tolower unicode::toupper unicode version of tolower 3 and toupper 3 #include <courier-unicode.h> std::string unicode::tolower const std::string &string std::string unicode::tolower const std::string &string const std::string &charset std::u32string unicode::tolower const std::u32string &u std::string unicode::toupper const std::string &string std::string unicode::toupper const std::string &string const std::string &charset std::u32string unicode::toupper const std::u32string &u DESCRIPTION These functions convert the string parameter, in charset or unicode_default_chset 3, to unicode, replace each character with unicode_lc 3 or unicode_uc 3, then convert it back to the same character set, returning the resulting string. Passing a const std::u32string & directly also converts it accordingly, returning the converted unicode string. SEE ALSO courier-unicode 7. SamVarshavchikAuthorCourier Unicode Library unicode::wordbreak 3 unicode::wordbreak_callback_base unicode::wordbreak unicode word-breaking rules #include <courier-unicode.h> class wordbreak : public unicode::wordbreak_callback_base { public: using unicode::wordbreak_callback_base::operator<<; using unicode::wordbreak_callback_base::operator(); int callback(bool flag) { // ... } }; char32_t c; std::u32string buf; wordbreak compute_wordbreak; compute_wordbreak << c; compute_wordbreak(buf); compute_wordbreak(buf.begin(), buf.end()); compute_wordbreak.finish(); // ... unicode_wordbreakscan scan; scan << c; size_t nchars=scan.finish(); DESCRIPTION unicode::wordbreak_callback_base is a C++ binding for the unicode word-breaking rule implementation described in unicode_word_break 3. Subclass unicode::wordbreak_callback_base and implement callback() that's virtually inherited from unicode::wordbreak_callback_base. The callback() callback function receives the output values from the word-breaking algorithm, namely a bool indicating whether a word break exists before the unicode character in the underlying input sequence. callback() should return 0. A non-zero return reports an error, that stops the word-breaking algorithm. See unicode_word_break 3 for more information. The input unicode characters for the word-breaking algorithm are provided by the << operator, one unicode character at a time; or by the () operator, passing either a container, or a beginning and an ending iterator value for an input sequence of unicode characters. finish() indicates the end of the unicode character sequence. unicode::wordbreakscan is a C++ binding for the unicode_wbscan_init(), unicode_wbscan_next() and unicode_wbscan_end methods described in unicode_word_break 3. Its << iterates over the unicode characters, and finish() indicates the number of characters before the first unicode word break. The << iterator returns a bool indicating when the first word break has already been found, so further calls are not necessary. SEE ALSO courier-unicode 7, unicode_word_break 3.
COPYING The Courier Unicode Library is free software, distributed under the terms of the GPL, version 3: