Types of data

There are two main types of data, non tabular data, such as documents, images and blocks of JSON or XML, and tabular data. Tabular data can be though of as the sort of data you can put into an Excel spreadsheet with rows and columns. xMart is designed to store and work with tabular data.

Types of tabular data

There are several ways to think of data in the system. The data has different uses and is thought of differently.

Fact and Reference data

Here is a typical Excel spreadsheet with data in it.

Image showing an Excel spreadsheet with the column names ISO3, Date, Sex, Classification and Count. There is a dotted red line between Classification and Count. The first 4 columns have a label of Reference Data and the Count column has a label of Fact Data

You would expect a report on this data to be sliced by some combination of ISO3, Date, Sex and Classification but not be Count which you would expect to be compared, summed, averaged etc.

The Count is considered Fact data.

The columns ISO3, Date, Sex and Classification are considered Reference data.

Fact data

A fact is a set of values which is different for a set of reference columns and changes over time. It is the actual result of the parameter which is measured.

Reference data

Reference data is data like the country table which isn’t a measure and normally doesn’t change.

There are various benefits to splitting out reference data into separate tables

  • The fact data could come from an outside source and only contain the ISO3 Code for example.

  • By using reference data, you now know data such as the WHO Region although the source has no idea of what it is
  • You can also accept country names in different languages and know which country it refers to
  • It also makes drop down lists easier to create and translate
  • Standardizes things like names
  • Any changes to the reference data can appear to the end user if the reference data is combined with the fact data. This is the case for both new and historic data

Aggregated data

In Excel, you can do various functions on the columns like Sum, Average, Count, Maximum and Minimum. The columns with these types of data are considered Aggregated data.

Article List / Line list

This is data which can be cleaned but contains a lot of data for example case data.

Raw (untransformed)

Sometimes, it is better to load the raw data in without any checking and data cleansing because there may be limits on the source access time or the raw data needs to be cleaned by hand before being loaded into Fact tables.

What are the advantages of using reference data?

  • The fact data could come from an outside source and only contain the ISO3 Code for example
  • By using reference data, you now know data such as the WHO Region although the source has no idea of what it is
  • You can also accept country names in different languages and know which country it refers to
  • It also makes drop down lists easier to create and translate
  • It can be used to standardize things like names
  • Any changes to the reference data can appear to the end user if the reference data is combined with the fact data. This is the case for both new and historic data

Getting started

If you imagine you were going to build a warehouse for storing products to ship to customers, there are various questions you would ask yourself:

  • What products am I going to store?
  • How am I going to arrange the products? Do I need new shelves or do the products belong on other existing shelves?
  • How am I going to organise my products on the shelves ?
  • How am I going to split my shelves into sections to hold the products?
  • How am I going to be able to find the products?
  • Do I have products that are connected in some way ?
  • Do I need a new warehouse or can I use one that already exists?

You can think about marts in a similar way. You need to ask the same sort of questions

  • What data am I going to store?
  • How am I going to arrange the data? Do I need new tables or does the data belong on other existing tables?
  • How am I going to organise my data in the tables?
  • How am I going to be able to find the data?
  • Do I have data that is connected in some way ?
  • Do I need a new mart or can I use one that already exists?

If you need a new set of tables then you will need to create a good model for it.

What is a data model?

A data model consists of several parts

  • Location (Mart)
  • Tables
  • Fields in the Tables
  • Fact and reference data
  • How the Tables relate to each other

Tables

What is a table?

Like an Excel tab, a table is a set of data stored in rows and columns.

In an Excel tab, the columns run across the top and are referred to be letters and the rows run down the side and are referred to by numbers.

Image showing an Excel sheet with 5 columns and one row. The columns are arranged across the top and each one has a letter. They have the headings ISO3, Date, Sex, Classification and Count. There are numbers underneath each of the columns. To the right is a sequential row number. The first line of data is shown to be a row

A table is the same where the columns have headings and are across the top and the data has rows and goes down the side.

Image showing a table with 5 columns and one row. The columns are arranged across the top and have the headings ISO3, Date, Sex, Classification and Count. There are numbers underneath each of the columns. The first line of data is shown to be a row

Like Excel tabs and cells which can reference each other, tables and columns can reference each other.

Unlike Excel columns, each table column must have a data type reflecting the data which is going to be stored.

What are data types?

In Excel, the user can force a column or set of cells to have a type by formatting the cell Image showing an Excel spreadsheet with the format Cells boy on the data type selection tab

There are various data types available which vary from system to system. They are mostly the same but sometimes have different names.

In xMart, there are the following data types

Integer data types

Data Type Storage
BIG_INTEGER Whole number from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807
INTEGER Whole number from -2,147,483,648 to 2,147,483,647

Decimal data types

Data Type Storage
DECIMAL_16_9 A decimal datatype limited to 16 digits total: 7 digits to the left of the decimal point and 9 to the right.
DECIMAL_28_14 A decimal datatype limited to 28 digits total: 14 digits to the left of the decimal point and 14 to the right.
DECIMAL_28_9 A decimal datatype limited to 28 digits total: 19 digits to the left of the decimal point and 9 to the right.
PERCENT Decimal percent data type, from -1000 to 1000. Input string format expected to end with ‘’%’’ character

String data types

The string types in xMart are unicode (extended) character strings (i.e. support Chinese, Russian etc.) of the length specified.

Data Type Storage
TEXT_2 2 unicode characters
TEXT_3 3 unicode characters
TEXT_10 10 unicode characters
TEXT_50 50 unicode characters
TEXT_450 450 unicode characters
TEXT_4000 4000 unicode characters
TEXT_MAX Up to 2GB of Text but can be slow when filtered

Data and Time data types

Data Type Storage
DATE Date values including year, month and day. No time information.
DATE_TIME Date-time values including year, month, day and also time values to sub-second precision.
MONTH 1 or 2 digit month values such as 3 (March) or 10 (October)
YEAR Full-digit year values, such as ‘2003’

Other data types

Data Type Storage
BOOLEAN Stored as 1 or 0 but data can be provided using various common synonyms such as True/False, T/F, Yes/No, Y/N.
GUID 36-character value that is globally unique across tables, databases, and servers.
MEDIA Media type for binary data like pictures/files. For better display, a field with code MIMETYPE will be used if available. We don’t recommend storing these types of files because the retrieval can be slow
XML Valid XML
FOREIGN_KEY Foreign key to another table

Foreign Keys

A Foreign Key is udes to join tables together. It makes sure that any value put into a field exists in another table. This is useful for checking if the data being loaded is valid. So for example, if someone loads an invalid country code, the load procedure won’t accept it and will flag it up as an error.

Here is an example of a Foreign Key being used to connect Fact data with Reference data.

Star Schema

Image showing a Fact table record in the middle with the headers ISO3, Date, Sex, Classification and Count. Above the Fact table, to the left, there is a list of Genders with a code and the translation into different languages. The label for this table is Gender Reference Table. There is an arrow going between the Sex column of the Fact table and the Code column of the Gender Reference table. This arrow has a label Foreign Key. To the right of the Gender Reference table,  there is a table with a Code column, a Title Column and a Description column. The label for this table is Classification Reference Data. There is an arrow going between the Classification column of the Fact table and the Code column of the Classification Reference table. This arrow has a label Foreign Key. Below the fact table it there is a separate table showing a list of data for a country with the label Country Reference Data. There is an arrow between the ISO3 column of the Fact table and the ISO3 column of the Country Reference table with the label Foreign Key

This arrangement is known as a Star schema because it can be represented in a diagram the Fact table sits in the middle and the Reference tables sitting all around it connected by lines so it would look like an star.

Image showing a five point star. IN the middle is a box with the label Fact Table. At each point of the star there is a box with the label Reference Table. There are arrows pointing from the Fact Table box in the middle to the Reference Table boxes at each point of the star

Snowflake Schema

You can have Reference tables which have Foreign Keys as well. So for example, it I have a Reference table of countries and a Reference table of Country Capital Cities, I can join my Fact table to the capital cities be going through the Country Reference table.

Image showing a Fact table record in the middle with the headers ISO3, Date, Sex, Classification and Count. Above the Fact table, to the left, there is a list of Genders with a code and the translation into different languages. The label for this table is Gender Reference Table. There is an arrow going between the Sex column of the Fact table and the Code column of the Gender Reference table. This arrow has a label Foreign Key. To the right of the Gender Reference table,  there is a table with a Code column, a Title Column and a Description column. The label for this table is Classification Reference Data. There is an arrow going between the Classification column of the Fact table and the Code column of the Classification Reference table. This arrow has a label Foreign Key. Below the fact table there is a separate table showing a list of data for a country with the label Country Reference Data. There is an arrow between the ISO3 column of the Fact table and the ISO3 column of the Country Reference table with the label Foreign Key. Underneath the Country Reference table to the left, there is a table with the label Country Captial Reference Table. There is an arrow between the ISO3 code of the Country Reference table and the Country Capital Reference table. This arrow has the label Froeign Key

This is known as a Snowflake Schema because the Fact table sits in the middle and the lines come off it and then other lines come off them so the lines resemble a snowflake

Image showing a box in the middle with the label Fact table. It is connected by a vertical arrow which points to a box directly above it. The box has the label Reference Table. There are diagonal arrows coming from each side and an arrow pointing straight upwards. These arrows point to boxes with the label Reference Data. Each of those boxes has diagonal arrows coming from each side and a straight arrow towards the edge of the page. The arrows point to boxes with the label Reference Data. The same pattern is repeated underneath

Data Cleansing

Apart from joining tables, Foreign Keys also checks the data entered for valid values so are a form of data cleansing.

There are various reasons why data may be rejected as invalid

Possible reason Resolution
There is a value which is completely invalid like an ISO3 Code of “ZZZ” This is invalid so they will need to correct it
The value is in another language or is an alternative spelling (for example “Suisse” instead of “Switzerland”) It’s valid but in a different language. Use a different column for the lookup or use a Synonym table
There is a spelling mistake on the data source (for example “Siwtzerland” instead of “Switzerland”) If the data comes from a user via Excel or CSV then it is best to correct it at source. However, it is not always possible to do this if the data comes from an extarnal source like another agengy. In which case, it is best to use a synonym table

Synonym tables

A Synonam table is a table in the data load pipeline to interpret what comes in to what you were hoping for.

Image showing a Fact table record at the top with the headers Country Name, Date, Sex, Classification and Count. Below the Fact table, to the right, there is a table with two columns. The column on the left is marked Input Country Name and contains the country names for Switzerland transled into different languages. The column on the right is marked Output ISO3 and contains the ISO3 code for Switzerland. The label for this table is Synonym Table. There is an dashed arrow going between the Country Name column of the Fact table which has the value Suisse and the Input Country Name column of the Synonym table where the Input Country Name is Suisse. Below the Synonym table there is a separate table showing a list of data for a country with the label Country Reference Data. There is a solid arrow between the Country Name column of the Fact table and the ISO3 column of the Country Reference table with the label Foreign Key. There is a dashed line going between the Output ISO3 of the Synonym table to the ISO3 column of the country Reference table

Many-to-Many Relationships

The Problem

Imagine there is a situation where there are 3 members of a team; Chris Diakos, Chris Faulkner and Chris Tantillo.

There are 3 tasks which require them to work in pairs; Task 1, Task 2 and Task 3.

They have each been assigned 2 tasks and they always work with someone different.

Chris Diakos is working on Task 1 and Task 2 Chris Faulkner is working on Task 2 and Task 3 Chris Tantillo is working on Task 3 and Task 1

The relationship looks like this

Image showing two tables. On the left, there’s a table labelled Person Table with two columns, labeled "Code" and "Name." This table lists three individuals:
CD for Chris Diakos
CF for Chris Faulkner
CT for Chris Tantillo.
On the right, there is another table labelled Task Table with two columns, labeled "Code" and "Title." This table lists three tasks:
T1 for Task 1
T2 for Task 2
T3 for Task 3.
Blue arrows connect entries between the two tables, representing assignments or relationships. Specifically:
Chris Diakos is linked to Task 1 and Task 3
Chris Faulkner is linked to Task 2 and Task 3
Chris Tantillo is linked to Task 1 and Task 3

This is called a many-to-many relationship.

How do you represent this in a database?

As the name suggests, a Link table is used as an intermediary table between two other tables and is used to implement many-to-many relationships. It consists of 2 foreign keys pointing to the referenced records. Both of the fields are the primary key to prevent duplication.

Image showing a "many-to-many" relationship model using three tables: Person Table, Task Table, and Person to Task Link Table.
Person Table (on the left): This table includes two columns, labeled "Code" and "Name," and lists three individuals:
CD for Chris Diakos
CF for Chris Faulkner
CT for Chris Tantillo
Task Table (on the right): This table has two columns, labeled "Code" and "Title," listing three tasks:
T1 for Task 1
T2 for Task 2
T3 for Task 3
Person to Task Link Table (in the center): This is a linking or junction table, labeled "Person to Task Link Table," which establishes the connections between the people and tasks. It has two columns:
Person_FK (foreign key for Person Code)
Task_FK (foreign key for Task Code)
Each row in this table represents an association between a person and a task, allowing for a "many-to-many" relationship. The associations are as follows:
Chris Diakos (CD) is linked to Task 1 (T1) and Task 2 (T2).
Chris Faulkner (CF) is linked to Task 2 (T2) and Task 3 (T3).
Chris Tantillo (CT) is linked to Task 1 (T1) and Task 3 (T3).

Arrows from the Person and Task tables to the Link table illustrate how the codes connect through this intermediary table, allowing each person to be associated with multiple tasks and vice versa. This structure is a classic “many-to-many” relationship model, which is commonly used in database design.

Primary Keys

A Primary Key is a unique identifier for each record in a database table, ensuring that each row can be distinctly identified. It consists of one or more columns, and the values in these columns must be unique across the entire table. Primary Keys are essential for establishing relationships between tables, as they can be referenced by foreign keys in other tables to create links. They improve data integrity, allowing databases to quickly retrieve, update, or delete records.

Whereas in a standard database Primary Keys are mandatory, xMart allows them to be set up as optional.

They are displayed like this with a Key in the Is Key field to indicate that the field is in the BPK and a red star next to the field code to indicate that it is mandatory

Image showing a table with five columns. The list of codes for these columns is INDEX_FK, COUNTRY_FK, MODEL_FK and VALUE. The first 4 fields have key icons in the first column (marked Is Key) and on the second, third and fourth lines there are red stars next to the column names