Types of data
There are two main types of data, non tabular data, such as documents, images and blocks of JSON or XML, and tabular data. Tabular data can be though of as the sort of data you can put into an Excel spreadsheet with rows and columns. xMart is designed to store and work with tabular data.
Types of tabular data
There are several ways to think of data in the system. The data has different uses and is thought of differently.
Fact and Reference data
Here is a typical Excel spreadsheet with data in it.
You would expect a report on this data to be sliced by some combination of ISO3, Date, Sex and Classification but not be Count which you would expect to be compared, summed, averaged etc.
The Count is considered Fact data.
The columns ISO3, Date, Sex and Classification are considered Reference data.
Fact data
A fact is a set of values which is different for a set of reference columns and changes over time. It is the actual result of the parameter which is measured.
Reference data
Reference data is data like the country table which isn’t a measure and normally doesn’t change.
There are various benefits to splitting out reference data into separate tables
-
The fact data could come from an outside source and only contain the ISO3 Code for example.
- By using reference data, you now know data such as the WHO Region although the source has no idea of what it is
- You can also accept country names in different languages and know which country it refers to
- It also makes drop down lists easier to create and translate
- Standardizes things like names
- Any changes to the reference data can appear to the end user if the reference data is combined with the fact data. This is the case for both new and historic data
Aggregated data
In Excel, you can do various functions on the columns like Sum, Average, Count, Maximum and Minimum. The columns with these types of data are considered Aggregated data.
Article List / Line list
This is data which can be cleaned but contains a lot of data for example case data.
Raw (untransformed)
Sometimes, it is better to load the raw data in without any checking and data cleansing because there may be limits on the source access time or the raw data needs to be cleaned by hand before being loaded into Fact tables.
What are the advantages of using reference data?
- The fact data could come from an outside source and only contain the ISO3 Code for example
- By using reference data, you now know data such as the WHO Region although the source has no idea of what it is
- You can also accept country names in different languages and know which country it refers to
- It also makes drop down lists easier to create and translate
- It can be used to standardize things like names
- Any changes to the reference data can appear to the end user if the reference data is combined with the fact data. This is the case for both new and historic data
Getting started
If you imagine you were going to build a warehouse for storing products to ship to customers, there are various questions you would ask yourself:
- What products am I going to store?
- How am I going to arrange the products? Do I need new shelves or do the products belong on other existing shelves?
- How am I going to organise my products on the shelves ?
- How am I going to split my shelves into sections to hold the products?
- How am I going to be able to find the products?
- Do I have products that are connected in some way ?
- Do I need a new warehouse or can I use one that already exists?
You can think about marts in a similar way. You need to ask the same sort of questions
- What data am I going to store?
- How am I going to arrange the data? Do I need new tables or does the data belong on other existing tables?
- How am I going to organise my data in the tables?
- How am I going to be able to find the data?
- Do I have data that is connected in some way ?
- Do I need a new mart or can I use one that already exists?
If you need a new set of tables then you will need to create a good model for it.
What is a data model?
A data model consists of several parts
- Location (Mart)
- Tables
- Fields in the Tables
- Fact and reference data
- How the Tables relate to each other
Tables
What is a table?
Like an Excel tab, a table is a set of data stored in rows and columns.
In an Excel tab, the columns run across the top and are referred to be letters and the rows run down the side and are referred to by numbers.
A table is the same where the columns have headings and are across the top and the data has rows and goes down the side.
Like Excel tabs and cells which can reference each other, tables and columns can reference each other.
Unlike Excel columns, each table column must have a data type reflecting the data which is going to be stored.
What are data types?
In Excel, the user can force a column or set of cells to have a type by formatting the cell
There are various data types available which vary from system to system. They are mostly the same but sometimes have different names.
In xMart, there are the following data types
Integer data types
Data Type | Storage |
---|---|
BIG_INTEGER | Whole number from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 |
INTEGER | Whole number from -2,147,483,648 to 2,147,483,647 |
Decimal data types
Data Type | Storage |
---|---|
DECIMAL_16_9 | A decimal datatype limited to 16 digits total: 7 digits to the left of the decimal point and 9 to the right. |
DECIMAL_28_14 | A decimal datatype limited to 28 digits total: 14 digits to the left of the decimal point and 14 to the right. |
DECIMAL_28_9 | A decimal datatype limited to 28 digits total: 19 digits to the left of the decimal point and 9 to the right. |
PERCENT | Decimal percent data type, from -1000 to 1000. Input string format expected to end with ‘’%’’ character |
String data types
The string types in xMart are unicode (extended) character strings (i.e. support Chinese, Russian etc.) of the length specified.
Data Type | Storage |
---|---|
TEXT_2 | 2 unicode characters |
TEXT_3 | 3 unicode characters |
TEXT_10 | 10 unicode characters |
TEXT_50 | 50 unicode characters |
TEXT_450 | 450 unicode characters |
TEXT_4000 | 4000 unicode characters |
TEXT_MAX | Up to 2GB of Text but can be slow when filtered |
Data and Time data types
Data Type | Storage |
---|---|
DATE | Date values including year, month and day. No time information. |
DATE_TIME | Date-time values including year, month, day and also time values to sub-second precision. |
MONTH | 1 or 2 digit month values such as 3 (March) or 10 (October) |
YEAR | Full-digit year values, such as ‘2003’ |
Other data types
Data Type | Storage |
---|---|
BOOLEAN | Stored as 1 or 0 but data can be provided using various common synonyms such as True/False, T/F, Yes/No, Y/N. |
GUID | 36-character value that is globally unique across tables, databases, and servers. |
MEDIA | Media type for binary data like pictures/files. For better display, a field with code MIMETYPE will be used if available. We don’t recommend storing these types of files because the retrieval can be slow |
XML | Valid XML |
FOREIGN_KEY | Foreign key to another table |
Foreign Keys
A Foreign Key is udes to join tables together. It makes sure that any value put into a field exists in another table. This is useful for checking if the data being loaded is valid. So for example, if someone loads an invalid country code, the load procedure won’t accept it and will flag it up as an error.
Here is an example of a Foreign Key being used to connect Fact data with Reference data.
Star Schema
This arrangement is known as a Star schema because it can be represented in a diagram the Fact table sits in the middle and the Reference tables sitting all around it connected by lines so it would look like an star.
Snowflake Schema
You can have Reference tables which have Foreign Keys as well. So for example, it I have a Reference table of countries and a Reference table of Country Capital Cities, I can join my Fact table to the capital cities be going through the Country Reference table.
This is known as a Snowflake Schema because the Fact table sits in the middle and the lines come off it and then other lines come off them so the lines resemble a snowflake
Data Cleansing
Apart from joining tables, Foreign Keys also checks the data entered for valid values so are a form of data cleansing.
There are various reasons why data may be rejected as invalid
Possible reason | Resolution |
---|---|
There is a value which is completely invalid like an ISO3 Code of “ZZZ” | This is invalid so they will need to correct it |
The value is in another language or is an alternative spelling (for example “Suisse” instead of “Switzerland”) | It’s valid but in a different language. Use a different column for the lookup or use a Synonym table |
There is a spelling mistake on the data source (for example “Siwtzerland” instead of “Switzerland”) | If the data comes from a user via Excel or CSV then it is best to correct it at source. However, it is not always possible to do this if the data comes from an extarnal source like another agengy. In which case, it is best to use a synonym table |
Synonym tables
A Synonam table is a table in the data load pipeline to interpret what comes in to what you were hoping for.
Many-to-Many Relationships
The Problem
Imagine there is a situation where there are 3 members of a team; Chris Diakos, Chris Faulkner and Chris Tantillo.
There are 3 tasks which require them to work in pairs; Task 1, Task 2 and Task 3.
They have each been assigned 2 tasks and they always work with someone different.
Chris Diakos is working on Task 1 and Task 2 Chris Faulkner is working on Task 2 and Task 3 Chris Tantillo is working on Task 3 and Task 1
The relationship looks like this
This is called a many-to-many relationship.
How do you represent this in a database?
Link Tables
As the name suggests, a Link table is used as an intermediary table between two other tables and is used to implement many-to-many relationships. It consists of 2 foreign keys pointing to the referenced records. Both of the fields are the primary key to prevent duplication.
Arrows from the Person and Task tables to the Link table illustrate how the codes connect through this intermediary table, allowing each person to be associated with multiple tasks and vice versa. This structure is a classic “many-to-many” relationship model, which is commonly used in database design.
Primary Keys
A Primary Key is a unique identifier for each record in a database table, ensuring that each row can be distinctly identified. It consists of one or more columns, and the values in these columns must be unique across the entire table. Primary Keys are essential for establishing relationships between tables, as they can be referenced by foreign keys in other tables to create links. They improve data integrity, allowing databases to quickly retrieve, update, or delete records.
Whereas in a standard database Primary Keys are mandatory, xMart allows them to be set up as optional.
They are displayed like this with a Key in the Is Key field to indicate that the field is in the BPK and a red star next to the field code to indicate that it is mandatory