This article describes the various user data types in Collection Pro and their behavior during import and export situations.
Text types
Text types are used to store text. Collection Pro supports text
, text_oneline
, string
, text_l10n
and text_l10n_oneline
.
text, text_oneline
text
and text_oneline
store characters in one language. In the server, these types are treated the same, text_oneline is only a hint for frontends on how to display the data. So it is possible (over the API) to use newlines in a text_online column.
Text is stored as received, so a leading or trailing space is preserved. Frontends are requested to trim texts before sending them to the API.
Text must be encoded in UTF-8 and stores in all normalization forms. There is no limit to the length of the text that can be stored.
API
Text looks like this when send and received over the API:
{
"text": "this is the text"
}
The above example has the value this is the text for column text.
Index
Normalization is performed as part of the indexer documentation creation, where all text is run through a icu_normalizer
.
In the indexer, text is stored using a custom analyzer icu_text
which works as follows:
{
"analysis": {
"icu_text": {
"type": "custom",
"tokenizer": "custom_icu_tokenizer",
"filter": [
"icu_folding"
],
"char_filter": [
"icu_normalizer"
]
},
"tokenizer": {
"custom_icu_tokenizer": {
"type": "pattern",
"pattern": "([[^\\p{L}\\p{Digit}\\uD800-\\uDFFF\\u2F00-\\u2FDF]&&[^&%§\\$€]])"
}
}
}
Text is normalized using the icu_normalizer
and tokenized into tokens using the above pattern.
What gets included in tokens:
- All alphabetic letters from any language.
- All numeric digits.
- Characters in the Unicode surrogate pair range and Kangxi Radicals.
- Symbols:
&
,%
,§
,$
, and€
.
What causes token separation:
- Punctuation marks (except the specified symbols).
- Whitespace characters.
- Other symbols and control characters not specified.
Tokens are then turned into terms using the icu_folding
token filter. The filter removes all punctuation and turns all characters into lower case. So the token Bär is stored as bar.
Using the API, searches for text can be performed either on the analyzed value (matching the terms), or on the unanalyzed value, which is stored alongside with all terms. The unanalyzed value stores the data as is. There is no normalization taking place.
All text for indexed documents is split into chunks of, 8000 UTF-8 characters. When matching full texts in analyzed form, text cannot easily be matched if they exceed 8000 characters.
Sorting
Sort strings are compiled using the Go library collate. It uses the first configured database language as assumption in what language the text is in. Numbers are recognized so that Car 100 sorts after Car 11. Text is normalized by the collate library. Internally, we use the hex representation of that string to work around anomalies in Elasticsearch. Some special replacement is always done [TODO LINKKI].
Export
Text is exported as is, keeping spaces & normalization.
The XML export looks like this for a column names title
and a value Title for the type text_oneline
. The column-api-id
in this example 29.
<title type="text_oneline" column-api-id="29">Title</title>
For type, text type is text
.
Output in CSV is as is, same for JSON.
string
The string
type’s main difference to the text type is how it’s indexed. It is recommended to use string types for identification strings which may contain special characters which would be dropped by the analyzer.
API
String looks like this when send and received over the API:
{
"ref": "A$5667"
}
The above example has the value $5667 for column ref
.
Index
String values are normalized and lowercased for the index document.
{
"analyzer": {
"keyword_lowercase": {
"tokenizer": "keyword",
"filter": [
"icu_folding"
],
"char_filter": [
"icu_normalizer"
]
}
}
Strings are normalized using the icu_normalizer
and converted to lower case using the icu_folding
token filter.
All strings for indexed documents are split into chunks of, 8000 UTF-8 characters. When matching full texts in analyzed form, text cannot easily be matched if they exceed 8000 characters.
Sorting
The sorting of string values works like for text
. In addition to the text sorting, a pure alphanumerical version is stored in the index alongside with the numerically sortable variant. With that, sorting can sort Car 10, Car 11, Car 12, Car 100. Some special replacement is always done.
Export
The XML looks like for text
.
<ref type="string" column-api-id="346">hall/7$</ref>
In this example, the column ref
is exported using value hall/7$.
The CSV and JSON export the string as is.
text_l10n, text_l10n_oneline
The types text_l10n
and text_l10n_oneline
are designed to store localized values. The format is a JSON object consisting of the language as key and the text as value.
API
Localized text looks like this send and received over the API.
{
"title_loca": {
"fi-FI": "Finnish",
"en-US": "English"
}
}
The above example set the Finnish value Finnish and English value English for column title_loca
.
The API does not check the language entered. So loading and saving an unspecified database language is supported.
Index
Indexing is done the same way text
format is indexed. Only enabled database languages are mapped into the index, other languages are ignored. After changing the settings for the database languages, a reindex is required.
Sorting
Sorting is performed using a collate string produced by the Go library collate. The language is parsed as BCP 47 string and passed to the library. Some special replacement is done:
'ʾ': '@', // 02BE Hamza (vorne offen) @ sorts this before A
'ʿ': '@', // 02BF Ayn (hinten offen) @ sorts this before A
Export
XML exported data looks like this:
<title_loca type="text_l10n" column-api-id="81">
<fi-Fi>Finnish</fi-FI>
<en-US>English</en-US>
</title_loca>
The example shows the XML snippet for a column title_loca
with the type text_l10n
.
In CSV, the values are exported like this:
title_loca.fi-FI | title_loca.en-US |
---|---|
Finnish | English |
For each language exported, a suffix .<code>
is added to the column name.
Number types
Number types are used to store numbers. Collection Pro supports number
, double
and integer.2
.
number
The type number
stores integers between -(2**53)+1
and (2**53)-1
. This follows the recommendation in RFC 8259. The range is between –9,007,199,254,740,991
and 9,007,199,254,740,991
(incl.).
API
Number looks like this when send and received over the API:
{
"number": 1234
}
The above example has the value 1234 for column number
.
Index
The index stores the number
as long
.
Sorting
The sorting is independent of the requested search language.
Export
The XML Export looks like this:
<number type="number" column-api-id="10">1234</number>
This shows the example export for column number
and a value of 1234
.
CSV and JSON export the number as is.
integer.2
The integer.2
type is like a currency type. The value is stored with a 2-digit fraction without loss. Internally server stored this as integer. Representations of the value in export and standard info display the fractions. For example the value 1234
will display and be rendered as 12.34
. The limits for this type are the same as for number.
API
Number looks like this when send and received over the API:
{
"integer_2": 567
}
The above example has the value 5.67 for column integer.2
.
Index
The index stores the integer.2
as long
.
Sorting
The sorting is independent of the requested search language.
Export
The XML Export looks like this:
<integer_2 type="integer.2" column-api-id="4">5.67</number>
This shows the example export for column integer_2 and a value of in 5.67
.
CSV export uses 5.67
whereas the JSON exports the value as 567
.
double
The type double
supports store and load for a floating point number with 64 bits (Go type float64
). The range for that number is approximately -1.7976931348623157e+308
to 1.7976931348623157e+308
. Go adheres to IEEE 754 for storing the values.
API
Number looks like this when send and received over the API:
{
"double": 1234.5678
}
Numbers are guaranteed to keep their value, but the JSON parser may change 1e4
into 10000
upon load.
Index
The type double
is stored as type double in the indexer.
Sorting
The sorting is independent of the requested search language.
Export
The XML Export looks like this:
<double type="double" column-api-id="2">1234.5678</number>
This shows the example export for column double and a value of in 1234.5678
.
CSV and JSON export the number as is.
Date types
The date types store dates with or without time and also date range. The server supports dates where only the year is given or only the year and month. The types are date, date+time
and daterange
.
date, date+time
The types date
and date+time
are used to load and store dates and date+times defined in ISO8601. Note that server doesn’t support a type to store a duration or just a time or date without year.
The server does support storing just the calendar year or year and month. These dates are stored as date range internally, so e.g. 2000
would be stored as range from 2000-01-01 00:00:00
to 2000-31-12 25:59:59
.
The supported range for dates is between years -292,277,022,399
and 292,277,022,399
. This is roughly UNIX seconds stored in a 64-bit integer.
API
The API treats the types date and date+time the same. So it is possible to store a date+time in a date value. The type is merely a hint for frontends to render the value accordingly.
The longest possible form of the value contains the timezone as shown in this example:Copy
{
"date": {
"value": "2010-12-10T12:45:00+01:00:00"
}
}
This stores the date 2010-12-20
with time 12:45:00
in time zone +01:00:00
. Time zones must be given using the offset. The server keeps the time zone. If a value without a time zone is given, the server assumes UTC as time zone.
The shortest possible date value looks like this:Copy
{
"date": {
"value": "2010"
}
}
This has the date 2010
spanning a range from 2010-01-01 00:00:00
to 2010-31-12 25:59:59
.
The API can parse a wide variety of formats. This includes parsing fractional seconds. Fractional seconds are discarded.
The output are always as follows
YEAR = "2006"
YEAR_MONTH = "2006-01"
DATE = "2006-01-02"
DATE_TIME = "2006-01-02T15:04"
DATE_TIME_SEC = "2006-01-02T15:04:05"
DATE_TIME_SEC_TZ = "2006-01-02T15:04:05Z07:00"
DATE_TIME_SEC_TZS = "2006-01-02T15:04:05Z07:00:00"
The left side reflects the recognized width of the value.
Note: The parser does not support sending a timezone for shortened date strings. For shorter date strings, the stored time zone is always UTC.
Index
The server calculates all dates to index based on the UNIX seconds computed for the UTC time zone of the given date. The values are stored as long
in the indexer. Various values are calculated to reflect the range every date creates. Alongside with the from
and to
values, a from_to
value is stored which allows for aggregations using the middle value of a date.
Sorting
Sorting can be done by using the stored subfield from
, to
or from_to
(default). Using the API sort values can be adjusted to match a specific date width (day
, week
, month
, year
). With that, values spanning two years in UTC (this New Year’s Eve in Sydney and Los Angeles), can be adjusted to be sorted by the same value using width year
.
Sort values can also be adjusted to a specific timezone using the API.
Export
The XML contains the date rendered according to the stored width of the date.
<date type="date" column-api-id="2">2010-12-10T12:45:00+01:00:00</date>
The example shows the output for a column date with date value 2010-12-10T12:45:00+01:00:00
.
The same value is used to the export in CSV and JSON.
daterange
The daterange
type stores a lower and upper value of a date range. Either one can be omitted to store an open range.
API
The API uses an object with from and to for the daterange type:Copy
{
"daterange": {
"from": "2001",
"to": "2010",
"text": {
"fi-FI": "Uuden vuosituhannen ensimmäinen vuosikymmen",
"en-US": "The first decade in the new millenium"
}
}
}
Parsing of the dates works the same as for the date, date+time
type. Thus, a daterange
can store the full set of widths available for the dates in fylr. The shortest date is the year and the longest the full time including a timezone with seconds.
Alongside with the date values, a textual value for the daterange
can be set. The format of the text property matches the format of the text_l10n
type.
Index
The index works the same as for the date, date+time
type. Internally server always works with date ranges to store values.
Alongside with the from
and to
values, a from_to
value is stored which allows for aggregations using the middle value of a date. To aggregate over the range, server additionally stores a range
value which consists of all values between the lower and upper limit. Server stores all years, months and days found in between the lower and upper value of the range, resp. If the date has the width year
, all years in between are stored for aggregations. E.g. for the above example, the dates 2001-01-01 00:00:00, 2002-01-01 00:00:00, ..., 2010-01-01 00:00:00
.
Sorting
Sorting works the same as described in for the date, date+time
type.
Export
The XML Export for the above example looks like this:Copy
<daterange type="daterange" column-api-id="4">
<from>2001</from>
<to>2010</to>
<text>
<de-DE>Die erste Dekade im neuen Millennium</de-DE>
<en-US>The first decade in the new millenium</en-US>
</text>
</daterange>
The JSON exports the data like this:Copy
{
"daterange": {
"from": "2001",
"to": "2010",
"text": {
"de-DE": "Die erste Dekade im neuen Millenium",
"en-US": "The first decade in the new millenium"
}
}
}
CSV exports like this:
daterange.from | daterange.to | daterange.text.fi-FI | daterange.text.en-US |
---|---|---|---|
2001 | 2010 | Uuden vuosituhannen ensimmäinen vuosikymmen | The first decade in the new millenium |
Other types
Other supported data types like boolean
, file
and geojson
.
boolean
The type boolean
can be used to store two states, false
and true
. The default is false
.
API
Boolean looks like this when send and received over the API:
{
"bool": true
}
The above example has the value true for column bool
.
Index
The indexed document contains an entry with false
or
. It is mapped as type true
boolean
.
Sorting
The ascending order of the sort is
, false
. The sorting is independent of the requested search language.true
Export
The XML representation looks like this:
<bool type="boolean" column-api-id="2">true</bool>
This is for a column bool
and the value
. true
is also always rendered.false
The CSV representation is
or false
, resp.true
The JSON representation is a JSON boolean.
The storage inside the server distinguished between null
and false
, but this is not visible over the API.
geojson
The geojson
type stores GeoJSON according to RFC 7946. For full GeoJSON support you need to use Opensearch as indexer or a licensed Elasticsearch.
Also, for server you will need a license with GeoJSON support. The license is not needed to use GeoJSON over /api/db, but it is for /api/search.
API
GeoJSON can contain multiple types of geo coordinates. The simplest type is Point which references a single point using latitude, longitude and altitude (optional):
{
"geo": {
"type": "Point",
"coordinates": [6.8652, 45.8326, 10]
}
}
In this example sets a Point
in column geo
: The longitude is set to 6.8652
, latitude to 45.8326
and altitude to 10
. Other types storable in GeoJSON are: ”Point”, ”MultiPoint”, ”LineString”, ”MultiLineString”, ”Polygon”, ”MultiPolygon”, and ”GeometryCollection”. fylr also supports ”Feature” and ”FeatureCollection”.
Index
The type geo_shape
is used to store the GeoJSON value in the indexer.
In some rare cases, the API will accept the GeoJSON but the indexer won’t. You will find these cases by looking for INDEX_ERROR events in /inspect/events or the Event Manager. If you encounter such a problem, your need to amend your data.
The values of all properties
of types ”Feature” and ”FeatureCollection” are collected and indexed as text
.
Sorting
Sorting is not supported for the geojson
type.
Export
The XML exports the actual GeoJSON value in JSON format. For the above example:
<geo type="geo_json" column-api-id="2">
{"coordinates":[6.8652,45.8326,10],"type":"Point"}
</geo>
The CSV contains the JSON as text:
{"coordinates":[6.8652,45.8326,10],"type":"Point"}
Same with JSON, the export looks like this:
{
"geo": {
"type": "Point",
"coordinates": [
6.8652,
45.8326,
10
]
},