User Data types

User Data types

This article describes the various user data types in Collection Pro and their behavior during import and export situations.

Text types

Text types are used to store text. Collection Pro supports text, text_oneline, string, text_l10n and text_l10n_oneline.

text, text_oneline

text and text_oneline store characters in one language. In the server, these types are treated the same, text_oneline is only a hint for frontends on how to display the data. So it is possible (over the API) to use newlines in a text_online column.

Text is stored as received, so a leading or trailing space is preserved. Frontends are requested to trim texts before sending them to the API.

Text must be encoded in UTF-8 and stores in all normalization forms. There is no limit to the length of the text that can be stored.

API

Text looks like this when send and received over the API:

{
    "text": "this is the text"
}

The above example has the value this is the text for column text.

Index

Normalization is performed as part of the indexer documentation creation, where all text is run through a icu_normalizer.

In the indexer, text is stored using a custom analyzer icu_text which works as follows:

{
  "analysis": {
   "icu_text": {
     "type": "custom",
     "tokenizer": "custom_icu_tokenizer",
     "filter": [
       "icu_folding"
     ],
     "char_filter": [
       "icu_normalizer"
     ]
  },
  "tokenizer": {
    "custom_icu_tokenizer": {
      "type": "pattern",
      "pattern": "([[^\\p{L}\\p{Digit}\\uD800-\\uDFFF\\u2F00-\\u2FDF]&&[^&%§\\$€]])"
    }
  }
}


Text is normalized using the icu_normalizer and tokenized into tokens using the above pattern.

What gets included in tokens:

  • All alphabetic letters from any language.
  • All numeric digits.
  • Characters in the Unicode surrogate pair range and Kangxi Radicals.
  • Symbols: &, %, §, $, and .

What causes token separation:

  • Punctuation marks (except the specified symbols).
  • Whitespace characters.
  • Other symbols and control characters not specified.

Tokens are then turned into terms using the icu_folding token filter. The filter removes all punctuation and turns all characters into lower case. So the token Bär is stored as bar.

Using the API, searches for text can be performed either on the analyzed value (matching the terms), or on the unanalyzed value, which is stored alongside with all terms. The unanalyzed value stores the data as is. There is no normalization taking place.

All text for indexed documents is split into chunks of, 8000 UTF-8 characters. When matching full texts in analyzed form, text cannot easily be matched if they exceed 8000 characters.

Sorting

Sort strings are compiled using the Go library collate. It uses the first configured database language as assumption in what language the text is in. Numbers are recognized so that Car 100 sorts after Car 11. Text is normalized by the collate library. Internally, we use the hex representation of that string to work around anomalies in Elasticsearch. Some special replacement is always done [TODO LINKKI].

Export

Text is exported as is, keeping spaces & normalization.

The XML export looks like this for a column names title and a value Title for the type text_oneline. The column-api-id in this example 29.

<title type="text_oneline" column-api-id="29">Title</title>

For type, text type is text.

Output in CSV is as is, same for JSON.


string

The string type’s main difference to the text type is how it’s indexed. It is recommended to use string types for identification strings which may contain special characters which would be dropped by the analyzer.

API

String looks like this when send and received over the API:

{
    "ref": "A$5667"
}

The above example has the value $5667 for column ref.

Index

String values are normalized and lowercased for the index document.

{
  "analyzer": {
    "keyword_lowercase": {
      "tokenizer": "keyword",
      "filter": [
        "icu_folding"
      ],
      "char_filter": [
         "icu_normalizer"
      ]
   }
}


Strings are normalized using the icu_normalizer and converted to lower case using the icu_folding token filter.

All strings for indexed documents are split into chunks of, 8000 UTF-8 characters. When matching full texts in analyzed form, text cannot easily be matched if they exceed 8000 characters.

Sorting

The sorting of string values works like for text. In addition to the text sorting, a pure alphanumerical version is stored in the index alongside with the numerically sortable variant. With that, sorting can sort Car 10, Car 11, Car 12, Car 100. Some special replacement is always done.

Export

The XML looks like for text.

<ref type="string" column-api-id="346">hall/7$</ref>


In this example, the column ref is exported using value hall/7$.

The CSV and JSON export the string as is.


text_l10n, text_l10n_oneline

The types text_l10n and text_l10n_oneline are designed to store localized values. The format is a JSON object consisting of the language as key and the text as value.

{
  "title_loca": {
    "fi-FI": "Finnish",
    "en-US": "English"
  }
}


The API does not check the language entered. So loading and saving an unspecified database language is supported.

Index

Indexing is done the same way text format is indexed. Only enabled database languages are mapped into the index, other languages are ignored. After changing the settings for the database languages, a reindex is required.

Sorting

Sorting is performed using a collate string produced by the Go library collate. The language is parsed as BCP 47 string and passed to the library. Some special replacement is done:

'ʾ': '@', //     02BE    Hamza (vorne offen) @ sorts this before A
'ʿ': '@', //     02BF    Ayn (hinten offen) @ sorts this before A

Export

XML exported data looks like this:

<title_loca type="text_l10n" column-api-id="81">
  <fi-Fi>Finnish</fi-FI>
  <en-US>English</en-US>
</title_loca>


The example shows the XML snippet for a column title_loca with the type text_l10n.

In CSV, the values are exported like this:

title_loca.fi-FItitle_loca.en-US
FinnishEnglish


For each language exported, a suffix .<code> is added to the column name.


Number types

Number types are used to store numbers. Collection Pro supports number, double and integer.2.

number

The type number stores integers between -(2**53)+1 and (2**53)-1. This follows the recommendation in RFC 8259. The range is between –9,007,199,254,740,991 and 9,007,199,254,740,991 (incl.).

Index

The index stores the number as long.

Sorting

The sorting is independent of the requested search language.

Export

The XML Export looks like this:

<number type="number" column-api-id="10">1234</number>


This shows the example export for column number and a value of 1234.

CSV and JSON export the number as is.

Other types

Collection Pro supports boolean, file and geojson as other types.

boolean

The type boolean can be used to store two states, false and true. The default is false.

Index

The indexed document contains an entry with false or true. It is mapped as type boolean.

Sorting

The ascending order of the sort is false , true. The sorting is independent of the requested search language.

Export

The XML representation looks like this:

<bool type="boolean" column-api-id="2">true</bool>


This is for a column bool and the value true. false is also always rendered.

The CSV representation is false or true, resp.

The JSON representation is a JSON boolean.

The storage inside the server distinguished between null and false, but this is not visible over the API.


Oliko artikkelista apua?

Aiheeseen liittyvää