TextEncoder¶
-
class
TextEncoder
¶ This class can be used to convert text between multiple representations, e.g. UTF-8 to UTF-16. You may use it as a static class object, passing the encoding each time, or you may create an instance and use that object, which will record the current encoding and retain the current string.
This class is also a base class of
TextNode
, which inherits this functionality.Inheritance diagram
-
enum
Encoding
¶ -
enumerator
E_iso8859
= 0¶
-
enumerator
E_utf8
= 1¶
-
enumerator
E_utf16be
= 2¶
-
enumerator
E_unicode
= 2¶ Deprecated alias for E_utf16be
-
enumerator
-
TextEncoder
(void)¶
-
TextEncoder
(TextEncoder const ©)¶
-
void
append_text
(PyObject *text)¶ Appends the indicates string to the end of the stored text.
-
void
append_unicode_char
(char32_t character)¶ Appends a single character to the end of the stored text. This may be a wide character, up to 16 bits in Unicode.
-
void
append_wtext
(std::wstring const &text)¶ Appends the indicates string to the end of the stored wide-character text.
-
void
clear_text
(void)¶ Removes the text from the
TextEncoder
.
-
PyObject *
decode_text
(PyObject *text) const¶
-
static PyObject *
decode_text
(PyObject *text, Encoding encoding)¶ Returns the given wstring decoded to a single-byte string, via the current encoding system.
Returns the given wstring decoded to a single-byte string, via the given encoding system.
-
static PyObject *
encode_wchar
(char32_t ch, Encoding encoding)¶ Encodes a single Unicode character into a one-, two-, three-, or four-byte string, according to the given encoding system.
-
PyObject *
encode_wtext
(std::wstring const &wtext) const¶
-
static PyObject *
encode_wtext
(std::wstring const &wtext, Encoding encoding)¶ Encodes a wide-text string into a single-char string, according to the current encoding.
Encodes a wide-text string into a single-char string, according to the given encoding.
-
Encoding
get_default_encoding
(void)¶ Specifies the default encoding to be used for all subsequently created
TextEncoder
objects. Seeset_encoding()
.
-
std::string
get_encoded_char
(std::size_t index) const¶
-
std::string
get_encoded_char
(std::size_t index, Encoding encoding) const¶ Returns the nth char of the stored text, as a one-, two-, or three-byte encoded string.
-
Encoding
get_encoding
(void) const¶ Returns the encoding by which the string set via
set_text()
is to be interpreted. Seeset_encoding()
.
-
std::size_t
get_num_chars
(void) const¶ Returns the number of characters in the stored text. This is a count of wide characters, after the string has been decoded according to
set_encoding()
.
-
PyObject *
get_text
(void) const¶
-
PyObject *
get_text
(TextEncoder::Encoding encoding) const¶ Returns the current text, as encoded via the current encoding system.
Returns the current text, as encoded via the indicated encoding system.
-
std::string
get_text_as_ascii
(void) const¶ Returns the text associated with the node, converted as nearly as possible to a fully-ASCII representation. This means replacing accented letters with their unaccented ASCII equivalents.
It is possible that some characters in the string cannot be converted to ASCII. (The string may involve symbols like the copyright symbol, for instance, or it might involve letters in some other alphabet such as Greek or Cyrillic, or even Latin letters like thorn or eth that are not part of the ASCII character set.) In this case, as much of the string as possible will be converted to ASCII, and the nonconvertible characters will remain encoded in the encoding specified by
set_encoding()
.
-
int
get_unicode_char
(std::size_t index) const¶ Returns the Unicode value of the nth character in the stored text. This may be a wide character (greater than 255), after the string has been decoded according to
set_encoding()
.
-
std::wstring const &
get_wtext
(void) const¶ Returns the text associated with the
TextEncoder
, as a wide-character string.
-
std::wstring
get_wtext_as_ascii
(void) const¶ Returns the text associated with the node, converted as nearly as possible to a fully-ASCII representation. This means replacing accented letters with their unaccented ASCII equivalents.
It is possible that some characters in the string cannot be converted to ASCII. (The string may involve symbols like the copyright symbol, for instance, or it might involve letters in some other alphabet such as Greek or Cyrillic, or even Latin letters like thorn or eth that are not part of the ASCII character set.) In this case, as much of the string as possible will be converted to ASCII, and the nonconvertible characters will remain in their original form.
-
bool
has_text
(void) const¶
-
bool
is_wtext
(void) const¶ Returns true if any of the characters in the string returned by
get_wtext()
are out of the range of an ASCII character (and, therefore,get_wtext()
should be called in preference toget_text()
).
-
std::string
lower
(std::string const &source)¶
-
std::string
lower
(std::string const &source, Encoding encoding)¶ Converts the string to lowercase, assuming the string is encoded in the default encoding.
Converts the string to lowercase, assuming the string is encoded in the indicated encoding.
-
void
make_lower
(void)¶ Adjusts the text stored within the encoder to all lowercase letters (preserving accent marks correctly).
-
void
make_upper
(void)¶ Adjusts the text stored within the encoder to all uppercase letters (preserving accent marks correctly).
-
std::string
reencode_text
(std::string const &text, Encoding from, Encoding to)¶ Given the indicated text string, which is assumed to be encoded via the encoding “from”, decodes it and then reencodes it into the encoding “to”, and returns the newly encoded string. This does not change or affect any properties on the
TextEncoder
itself.
-
void
set_default_encoding
(TextEncoder::Encoding encoding)¶ Specifies the default encoding to be used for all subsequently created
TextEncoder
objects. Seeset_encoding()
.
-
void
set_encoding
(TextEncoder::Encoding encoding)¶ Specifies how the string set via
set_text()
is to be interpreted. The default, E_iso8859, means a standard string with one-byte characters (i.e. ASCII). Other encodings are possible to take advantage of character sets with more than 256 characters.This affects only future calls to
set_text()
; it does not change text that was set previously.
-
void
set_text
(PyObject *text)¶
-
void
set_text
(PyObject *text, Encoding encoding)¶ Changes the text that is stored in the encoder. The text should be encoded according to the method indicated by
set_encoding()
. Subsequent calls toget_text()
will return this same string, whileget_wtext()
will return the decoded version of the string.The two-parameter version of set_text() accepts an explicit encoding; the text is immediately decoded and stored as a wide-character string. Subsequent calls to
get_text()
will return the same text re-encoded using whichever encoding is specified byset_encoding()
.
-
void
set_unicode_char
(std::size_t index, char32_t character)¶ Sets the Unicode value of the nth character in the stored text. This may be a wide character (greater than 255), after the string has been decoded according to
set_encoding()
.
-
void
set_wtext
(std::wstring const &wtext)¶ Direct support for wide-character strings. Now publishable with the new wstring support in interrogate.
Changes the text that is stored in the encoder. Subsequent calls to
get_wtext()
will return this same string, whileget_text()
will return the encoded version of the string.
-
bool
unicode_isalpha
(char32_t character)¶ Returns true if the indicated character is an alphabetic letter, false otherwise. This is akin to ctype’s isalpha(), extended to Unicode.
-
bool
unicode_isdigit
(char32_t character)¶ Returns true if the indicated character is a numeric digit, false otherwise. This is akin to ctype’s isdigit(), extended to Unicode.
-
bool
unicode_islower
(char32_t character)¶ Returns true if the indicated character is a lowercase letter, false otherwise. This is akin to ctype’s islower(), extended to Unicode.
-
bool
unicode_ispunct
(char32_t character)¶ Returns true if the indicated character is a punctuation mark, false otherwise. This is akin to ctype’s ispunct(), extended to Unicode.
-
bool
unicode_isspace
(char32_t character)¶ Returns true if the indicated character is a whitespace letter, false otherwise. This is akin to ctype’s isspace(), extended to Unicode.
-
bool
unicode_isupper
(char32_t character)¶ Returns true if the indicated character is an uppercase letter, false otherwise. This is akin to ctype’s isupper(), extended to Unicode.
-
int
unicode_tolower
(char32_t character)¶ Returns the uppercase equivalent of the given Unicode character. This is akin to ctype’s tolower(), extended to Unicode.
-
int
unicode_toupper
(char32_t character)¶ Returns the uppercase equivalent of the given Unicode character. This is akin to ctype’s toupper(), extended to Unicode.
-
std::string
upper
(std::string const &source)¶
-
enum