This article describes the various user data types in Collection Pro and their behavior during import and export situations.
Text types
Text types are used to store text. Collection Pro supports text
, text_oneline
, string
, text_l10n
and text_l10n_oneline
.
text, text_oneline
text
and text_oneline
store characters in one language. In the server, these types are treated the same, text_oneline is only a hint for frontends on how to display the data. So it is possible (over the API) to use newlines in a text_online column.
Text is stored as received, so a leading or trailing space is preserved. Frontends are requested to trim texts before sending them to the API.
Text must be encoded in UTF-8 and stores in all normalization forms. There is no limit to the length of the text that can be stored.
API
Text looks like this when send and received over the API:
{
"text": "this is the text"
}
The above example has the value this is the text for column text.
Index
Normalization is performed as part of the indexer documentation creation, where all text is run through a icu_normalizer
.
In the indexer, text is stored using a custom analyzer icu_text
which works as follows:
{
"analysis": {
"icu_text": {
"type": "custom",
"tokenizer": "custom_icu_tokenizer",
"filter": [
"icu_folding"
],
"char_filter": [
"icu_normalizer"
]
},
"tokenizer": {
"custom_icu_tokenizer": {
"type": "pattern",
"pattern": "([[^\\p{L}\\p{Digit}\\uD800-\\uDFFF\\u2F00-\\u2FDF]&&[^&%§\\$€]])"
}
}
}
Text is normalized using the icu_normalizer
and tokenized into tokens using the above pattern.
What gets included in tokens:
- All alphabetic letters from any language.
- All numeric digits.
- Characters in the Unicode surrogate pair range and Kangxi Radicals.
- Symbols:
&
,%
,§
,$
, and€
.
What causes token separation:
- Punctuation marks (except the specified symbols).
- Whitespace characters.
- Other symbols and control characters not specified.
Tokens are then turned into terms using the icu_folding
token filter. The filter removes all punctuation and turns all characters into lower case. So the token Bär is stored as bar.
Using the API, searches for text can be performed either on the analyzed value (matching the terms), or on the unanalyzed value, which is stored alongside with all terms. The unanalyzed value stores the data as is. There is no normalization taking place.
All text for indexed documents is split into chunks of, 8000 UTF-8 characters. When matching full texts in analyzed form, text cannot easily be matched if they exceed 8000 characters.
Sorting
Sort strings are compiled using the Go library collate. It uses the first configured database language as assumption in what language the text is in. Numbers are recognized so that Car 100 sorts after Car 11. Text is normalized by the collate library. Internally, we use the hex representation of that string to work around anomalies in Elasticsearch. Some special replacement is always done [TODO LINKKI].
Export
Text is exported as is, keeping spaces & normalization.
The XML export looks like this for a column names title
and a value Title for the type text_oneline
. The column-api-id
in this example 29.
<title type="text_oneline" column-api-id="29">Title</title>
For type, text type is text
.
Output in CSV is as is, same for JSON.
string
The string
type’s main difference to the text type is how it’s indexed. It is recommended to use string types for identification strings which may contain special characters which would be dropped by the analyzer.
API
String looks like this when send and received over the API:
{
"ref": "A$5667"
}
The above example has the value $5667 for column ref
.
Index
String values are normalized and lowercased for the index document.
{
"analyzer": {
"keyword_lowercase": {
"tokenizer": "keyword",
"filter": [
"icu_folding"
],
"char_filter": [
"icu_normalizer"
]
}
}
Strings are normalized using the icu_normalizer
and converted to lower case using the icu_folding
token filter.
All strings for indexed documents are split into chunks of, 8000 UTF-8 characters. When matching full texts in analyzed form, text cannot easily be matched if they exceed 8000 characters.
Sorting
The sorting of string values works like for text
. In addition to the text sorting, a pure alphanumerical version is stored in the index alongside with the numerically sortable variant. With that, sorting can sort Car 10, Car 11, Car 12, Car 100. Some special replacement is always done.
Export
The XML looks like for text
.
<ref type="string" column-api-id="346">hall/7$</ref>
In this example, the column ref
is exported using value hall/7$.
The CSV and JSON export the string as is.
text_l10n, text_l10n_oneline
The types text_l10n
and text_l10n_oneline
are designed to store localized values. The format is a JSON object consisting of the language as key and the text as value.
{
"title_loca": {
"fi-FI": "Finnish",
"en-US": "English"
}
}
The API does not check the language entered. So loading and saving an unspecified database language is supported.
Index
Indexing is done the same way text
format is indexed. Only enabled database languages are mapped into the index, other languages are ignored. After changing the settings for the database languages, a reindex is required.
Sorting
Sorting is performed using a collate string produced by the Go library collate. The language is parsed as BCP 47 string and passed to the library. Some special replacement is done:
'ʾ': '@', // 02BE Hamza (vorne offen) @ sorts this before A
'ʿ': '@', // 02BF Ayn (hinten offen) @ sorts this before A
Export
XML exported data looks like this:
<title_loca type="text_l10n" column-api-id="81">
<fi-Fi>Finnish</fi-FI>
<en-US>English</en-US>
</title_loca>
The example shows the XML snippet for a column title_loca
with the type text_l10n
.
In CSV, the values are exported like this:
title_loca.fi-FI | title_loca.en-US |
---|---|
Finnish | English |
For each language exported, a suffix .<code>
is added to the column name.
Number types
Number types are used to store numbers. Collection Pro supports number
, double
and integer.2
.
number
The type number
stores integers between -(2**53)+1
and (2**53)-1
. This follows the recommendation in RFC 8259. The range is between –9,007,199,254,740,991
and 9,007,199,254,740,991
(incl.).
Index
The index stores the number
as long
.
Sorting
The sorting is independent of the requested search language.
Export
The XML Export looks like this:
<number type="number" column-api-id="10">1234</number>
This shows the example export for column number
and a value of 1234
.
CSV and JSON export the number as is.
Other types
Collection Pro supports boolean
, file
and geojson
as other types.
boolean
The type boolean
can be used to store two states, false
and true
. The default is false
.
Index
The indexed document contains an entry with false
or
. It is mapped as type true
boolean
.
Sorting
The ascending order of the sort is
, false
. The sorting is independent of the requested search language.true
Export
The XML representation looks like this:
<bool type="boolean" column-api-id="2">true</bool>
This is for a column bool
and the value
. true
is also always rendered.false
The CSV representation is
or false
, resp.true
The JSON representation is a JSON boolean.
The storage inside the server distinguished between null
and false
, but this is not visible over the API.