User Data types

User Data types

This article describes the various user data types in Collection Pro and their behavior during import and export situations.

Text types

Text types are used to store text. Collection Pro supports text, text_oneline, string, text_l10n and text_l10n_oneline.

text, text_oneline

text and text_oneline store characters in one language. In the server, these types are treated the same, text_oneline is only a hint for frontends on how to display the data. So it is possible (over the API) to use newlines in a text_online column.

Text is stored as received, so a leading or trailing space is preserved. Frontends are requested to trim texts before sending them to the API.

Text must be encoded in UTF-8 and stores in all normalization forms. There is no limit to the length of the text that can be stored.

API

Text looks like this when send and received over the API:

{
    "text": "this is the text"
}

The above example has the value this is the text for column text.

Index

Normalization is performed as part of the indexer documentation creation, where all text is run through a icu_normalizer.

In the indexer, text is stored using a custom analyzer icu_text which works as follows:

{
  "analysis": {
   "icu_text": {
     "type": "custom",
     "tokenizer": "custom_icu_tokenizer",
     "filter": [
       "icu_folding"
     ],
     "char_filter": [
       "icu_normalizer"
     ]
  },
  "tokenizer": {
    "custom_icu_tokenizer": {
      "type": "pattern",
      "pattern": "([[^\\p{L}\\p{Digit}\\uD800-\\uDFFF\\u2F00-\\u2FDF]&&[^&%§\\$€]])"
    }
  }
}


Text is normalized using the icu_normalizer and tokenized into tokens using the above pattern.

What gets included in tokens:

  • All alphabetic letters from any language.
  • All numeric digits.
  • Characters in the Unicode surrogate pair range and Kangxi Radicals.
  • Symbols: &, %, §, $, and .

What causes token separation:

  • Punctuation marks (except the specified symbols).
  • Whitespace characters.
  • Other symbols and control characters not specified.

Tokens are then turned into terms using the icu_folding token filter. The filter removes all punctuation and turns all characters into lower case. So the token Bär is stored as bar.

Using the API, searches for text can be performed either on the analyzed value (matching the terms), or on the unanalyzed value, which is stored alongside with all terms. The unanalyzed value stores the data as is. There is no normalization taking place.

All text for indexed documents is split into chunks of, 8000 UTF-8 characters. When matching full texts in analyzed form, text cannot easily be matched if they exceed 8000 characters.

Sorting

Sort strings are compiled using the Go library collate. It uses the first configured database language as assumption in what language the text is in. Numbers are recognized so that Car 100 sorts after Car 11. Text is normalized by the collate library. Internally, we use the hex representation of that string to work around anomalies in Elasticsearch. Some special replacement is always done [TODO LINKKI].

Export

Text is exported as is, keeping spaces & normalization.

The XML export looks like this for a column names title and a value Title for the type text_oneline. The column-api-id in this example 29.

<title type="text_oneline" column-api-id="29">Title</title>

For type, text type is text.

Output in CSV is as is, same for JSON.


string

The string type’s main difference to the text type is how it’s indexed. It is recommended to use string types for identification strings which may contain special characters which would be dropped by the analyzer.

API

String looks like this when send and received over the API:

{
    "ref": "A$5667"
}

The above example has the value $5667 for column ref.

Index

String values are normalized and lowercased for the index document.

{
  "analyzer": {
    "keyword_lowercase": {
      "tokenizer": "keyword",
      "filter": [
        "icu_folding"
      ],
      "char_filter": [
         "icu_normalizer"
      ]
   }
}


Strings are normalized using the icu_normalizer and converted to lower case using the icu_folding token filter.

All strings for indexed documents are split into chunks of, 8000 UTF-8 characters. When matching full texts in analyzed form, text cannot easily be matched if they exceed 8000 characters.

Sorting

The sorting of string values works like for text. In addition to the text sorting, a pure alphanumerical version is stored in the index alongside with the numerically sortable variant. With that, sorting can sort Car 10, Car 11, Car 12, Car 100. Some special replacement is always done.

Export

The XML looks like for text.

<ref type="string" column-api-id="346">hall/7$</ref>


In this example, the column ref is exported using value hall/7$.

The CSV and JSON export the string as is.


text_l10n, text_l10n_oneline

The types text_l10n and text_l10n_oneline are designed to store localized values. The format is a JSON object consisting of the language as key and the text as value.

API

Localized text looks like this send and received over the API.

{
  "title_loca": {
    "fi-FI": "Finnish",
    "en-US": "English"
  }
}

The above example set the Finnish value Finnish and English value English for column title_loca.
The API does not check the language entered. So loading and saving an unspecified database language is supported.

Index

Indexing is done the same way text format is indexed. Only enabled database languages are mapped into the index, other languages are ignored. After changing the settings for the database languages, a reindex is required.

Sorting

Sorting is performed using a collate string produced by the Go library collate. The language is parsed as BCP 47 string and passed to the library. Some special replacement is done:

'ʾ': '@', //     02BE    Hamza (vorne offen) @ sorts this before A
'ʿ': '@', //     02BF    Ayn (hinten offen) @ sorts this before A

Export

XML exported data looks like this:

<title_loca type="text_l10n" column-api-id="81">
  <fi-Fi>Finnish</fi-FI>
  <en-US>English</en-US>
</title_loca>


The example shows the XML snippet for a column title_loca with the type text_l10n.

In CSV, the values are exported like this:

title_loca.fi-FItitle_loca.en-US
FinnishEnglish


For each language exported, a suffix .<code> is added to the column name.


Number types

Number types are used to store numbers. Collection Pro supports number, double and integer.2.

number

The type number stores integers between -(2**53)+1 and (2**53)-1. This follows the recommendation in RFC 8259. The range is between –9,007,199,254,740,991 and 9,007,199,254,740,991 (incl.).

API

Number looks like this when send and received over the API:

{
    "number": 1234
}

The above example has the value 1234 for column number.

Index

The index stores the number as long.

Sorting

The sorting is independent of the requested search language.

Export

The XML Export looks like this:

<number type="number" column-api-id="10">1234</number>


This shows the example export for column number and a value of 1234.

CSV and JSON export the number as is.

integer.2

The integer.2 type is like a currency type. The value is stored with a 2-digit fraction without loss. Internally server stored this as integer. Representations of the value in export and standard info display the fractions. For example the value 1234 will display and be rendered as 12.34. The limits for this type are the same as for number.

API

Number looks like this when send and received over the API:

{
    "integer_2": 567
}

The above example has the value 5.67 for column integer.2.

Index

The index stores the integer.2 as long.

Sorting

The sorting is independent of the requested search language.

Export

The XML Export looks like this:

<integer_2 type="integer.2" column-api-id="4">5.67</number>


This shows the example export for column integer_2 and a value of in 5.67.

CSV export uses 5.67 whereas the JSON exports the value as 567.

double

The type double supports store and load for a floating point number with 64 bits (Go type float64). The range for that number is approximately -1.7976931348623157e+308 to 1.7976931348623157e+308. Go adheres to IEEE 754 for storing the values.

API

Number looks like this when send and received over the API:

{
    "double": 1234.5678
}

Numbers are guaranteed to keep their value, but the JSON parser may change 1e4 into 10000 upon load.

Index

The type double is stored as type double in the indexer.

Sorting

The sorting is independent of the requested search language.

Export

The XML Export looks like this:

<double type="double" column-api-id="2">1234.5678</number>


This shows the example export for column double and a value of in 1234.5678.

CSV and JSON export the number as is.


Date types

The date types store dates with or without time and also date range. The server supports dates where only the year is given or only the year and month. The types are date, date+time and daterange.

date, date+time

The types date and date+time are used to load and store dates and date+times defined in ISO8601. Note that server doesn’t support a type to store a duration or just a time or date without year.

The server does support storing just the calendar year or year and month. These dates are stored as date range internally, so e.g. 2000 would be stored as range from 2000-01-01 00:00:00 to 2000-31-12 25:59:59.

The supported range for dates is between years -292,277,022,399 and 292,277,022,399. This is roughly UNIX seconds stored in a 64-bit integer.

API

The API treats the types date and date+time the same. So it is possible to store a date+time in a date value. The type is merely a hint for frontends to render the value accordingly.

The longest possible form of the value contains the timezone as shown in this example:Copy

{
  "date": {
    "value": "2010-12-10T12:45:00+01:00:00"
  }
}

This stores the date 2010-12-20 with time 12:45:00 in time zone +01:00:00. Time zones must be given using the offset. The server keeps the time zone. If a value without a time zone is given, the server assumes UTC as time zone.

The shortest possible date value looks like this:Copy

{
  "date": {
    "value": "2010"
  }
}

This has the date 2010 spanning a range from 2010-01-01 00:00:00 to 2010-31-12 25:59:59.

The API can parse a wide variety of formats. This includes parsing fractional seconds. Fractional seconds are discarded.

The output are always as follows

YEAR              = "2006"
YEAR_MONTH        = "2006-01"
DATE              = "2006-01-02"
DATE_TIME         = "2006-01-02T15:04"
DATE_TIME_SEC     = "2006-01-02T15:04:05"
DATE_TIME_SEC_TZ  = "2006-01-02T15:04:05Z07:00"
DATE_TIME_SEC_TZS = "2006-01-02T15:04:05Z07:00:00"

The left side reflects the recognized width of the value.

Note: The parser does not support sending a timezone for shortened date strings. For shorter date strings, the stored time zone is always UTC.

Index

The server calculates all dates to index based on the UNIX seconds computed for the UTC time zone of the given date. The values are stored as long in the indexer. Various values are calculated to reflect the range every date creates. Alongside with the from and to values, a from_to value is stored which allows for aggregations using the middle value of a date.

Sorting

Sorting can be done by using the stored subfield from, to or from_to (default). Using the API sort values can be adjusted to match a specific date width (day, week, month, year). With that, values spanning two years in UTC (this New Year’s Eve in Sydney and Los Angeles), can be adjusted to be sorted by the same value using width year.

Sort values can also be adjusted to a specific timezone using the API.

Export

The XML contains the date rendered according to the stored width of the date.

<date type="date" column-api-id="2">2010-12-10T12:45:00+01:00:00</date>

The example shows the output for a column date with date value 2010-12-10T12:45:00+01:00:00.

The same value is used to the export in CSV and JSON.

daterange

The daterange type stores a lower and upper value of a date range. Either one can be omitted to store an open range.

API

The API uses an object with from and to for the daterange type:Copy

{
  "daterange": {
    "from": "2001",
    "to": "2010",
    "text": {
      "fi-FI": "Uuden vuosituhannen ensimmäinen vuosikymmen",
      "en-US": "The first decade in the new millenium"
    }
  }
}

Parsing of the dates works the same as for the date, date+time type. Thus, a daterange can store the full set of widths available for the dates in fylr. The shortest date is the year and the longest the full time including a timezone with seconds.

Alongside with the date values, a textual value for the daterange can be set. The format of the text property matches the format of the text_l10n type.

Index

The index works the same as for the date, date+time type. Internally server always works with date ranges to store values.

Alongside with the from and to values, a from_to value is stored which allows for aggregations using the middle value of a date. To aggregate over the range, server additionally stores a range value which consists of all values between the lower and upper limit. Server stores all years, months and days found in between the lower and upper value of the range, resp. If the date has the width year, all years in between are stored for aggregations. E.g. for the above example, the dates 2001-01-01 00:00:00, 2002-01-01 00:00:00, ..., 2010-01-01 00:00:00.

Sorting

Sorting works the same as described in for the date, date+time type.

Export

The XML Export for the above example looks like this:Copy

<daterange type="daterange" column-api-id="4">
  <from>2001</from>
  <to>2010</to>
  <text>
    <de-DE>Die erste Dekade im neuen Millennium</de-DE>
    <en-US>The first decade in the new millenium</en-US>
  </text>
</daterange>

The JSON exports the data like this:Copy

{
  "daterange": {
    "from": "2001",
    "to": "2010",
    "text": {
      "de-DE": "Die erste Dekade im neuen Millenium",
      "en-US": "The first decade in the new millenium"
    }
  }
}

CSV exports like this:

daterange.fromdaterange.todaterange.text.fi-FIdaterange.text.en-US
20012010Uuden vuosituhannen ensimmäinen vuosikymmenThe first decade in the new millenium

Other types

Other supported data types like boolean, file and geojson.

boolean

The type boolean can be used to store two states, false and true. The default is false.

API

Boolean looks like this when send and received over the API:

{
    "bool": true
}

The above example has the value true for column bool.

Index

The indexed document contains an entry with false or true. It is mapped as type boolean.

Sorting

The ascending order of the sort is false , true. The sorting is independent of the requested search language.

Export

The XML representation looks like this:

<bool type="boolean" column-api-id="2">true</bool>

This is for a column bool and the value true. false is also always rendered.

The CSV representation is false or true, resp.

The JSON representation is a JSON boolean.

The storage inside the server distinguished between null and false, but this is not visible over the API.

geojson

The geojson type stores GeoJSON according to RFC 7946. For full GeoJSON support you need to use Opensearch as indexer or a licensed Elasticsearch.

Also, for server you will need a license with GeoJSON support. The license is not needed to use GeoJSON over /api/db, but it is for /api/search.

API

GeoJSON can contain multiple types of geo coordinates. The simplest type is Point which references a single point using latitude, longitude and altitude (optional):

{
  "geo": {
    "type": "Point",
    "coordinates": [6.8652, 45.8326, 10]
  }
}

In this example sets a Point in column geo: The longitude is set to 6.8652, latitude to 45.8326 and altitude to 10. Other types storable in GeoJSON are: ”Point”, ”MultiPoint”, ”LineString”, ”MultiLineString”, ”Polygon”, ”MultiPolygon”, and ”GeometryCollection”. fylr also supports ”Feature” and ”FeatureCollection”.

Index

The type geo_shape is used to store the GeoJSON value in the indexer.

In some rare cases, the API will accept the GeoJSON but the indexer won’t. You will find these cases by looking for INDEX_ERROR events in /inspect/events or the Event Manager. If you encounter such a problem, your need to amend your data.

The values of all properties of types ”Feature” and ”FeatureCollection” are collected and indexed as text.

Sorting

Sorting is not supported for the geojson type.

Export

The XML exports the actual GeoJSON value in JSON format. For the above example:

<geo type="geo_json" column-api-id="2">
   {&#34;coordinates&#34;:[6.8652,45.8326,10],&#34;type&#34;:&#34;Point&#34;}
</geo>

The CSV contains the JSON as text:

{"coordinates":[6.8652,45.8326,10],"type":"Point"}

Same with JSON, the export looks like this:

{
  "geo": {
  "type": "Point",
  "coordinates": [
    6.8652,
    45.8326,
    10
  ]
},

Oliko artikkelista apua?

Aiheeseen liittyvää