Typical use case
You have case data with free-text subnational administrative place names. Each case record has these fields: Admin0Raw (ex: Country), Admin1Raw (ex: Province), Admin2Raw (ex: District).
You need to validate and match the freetext names with a reference table of geoplaces. Let’s call the reference table for geography REF_GEOPLACES. We will first show how to do a hierarchy lookup directly against the REF_GEOPLACES table, and then we will show how to use a synonym table to translate any freetext spelling variations to the reference table.
Example Table CASES (your input data)
CASE_ID | Admin0Raw | Admin1Raw | Admin2Raw |
---|---|---|---|
123 | BEN | ATACORA | Kouandé |
678 | BEN | DONGA | Bassila |
879 | BEN | DonGA | Unknown |
Table REF_GEOPLACES
Typical geography reference table
PlaceID | Admin0 | Admin1 | Admin2 | GeoLevel |
---|---|---|---|---|
27d0edf7 | BEN | 2 | ||
741b3ef2 | BEN | ATACORA | 3 | |
b87f5228 | BEN | ATACORA | Kouandé | 4 |
25cb5755 | BEN | DONGA | 3 | |
7ad5511f | BEN | DONGA | Bassila | 4 |
The goal is to create a new PlaceID column in the input CASES table to hold the value from the matching PlaceID column from the REF_GEOPLACES table. Any failed lookup will be reported as a batch issue.
Note that matching is case insensitive.
Here is how to write such a lookup command. “Source” refers to the input CASES table. “Lookup” refers to the reference REF_GEOPLACES table.
<Transform>
<HierarchyLookup LookupTable="REF_GEOPLACES" RegisterMissingAsIssues="true">
<Lookups LookupResult="ResolvedId" SourceResult="PlaceID">
<Lookup Source="Admin0Raw" Lookup="Admin0" LookupFilter="GeoLevel = 2" />
<Lookup Source="Admin1Raw" Lookup="Admin1" LookupFilter="GeoLevel = 3" />
<Lookup Source="Admin2Raw" Lookup="Admin2" LookupFilter="GeoLevel = 4" />
</Lookups>
</HierarchyLookup>
</Transform>
The order of the <Lookup>
elements is defining the actual hierarchy of parent - child relationships. They must be ordered from the higher level to the lower one.
After resolution
In this example, we only requested the final PlaceID of the resolved geography (see last column at the end).
CASE_ID | Admin0Raw | Admin1Raw | Admin2Raw | PlaceID |
---|---|---|---|---|
123 | Ben | Atacora | Kouandé | b87f5228 (PlaceID of resolved admin2) |
123 | BEN | DonGA | Bassila | 7ad5511f (PlaceID of resolved admin2) |
123 | BEN | DonGA | Unknown | 25cb5755 (admin3 not resolved so PlaceID of resolved admin1) |
The command is also able to add a new column for each level of lookup that has been resolved. In each Lookup command, indicate which column from the reference table (LookupResult) should be put into which column of the cases table (SourceResult), as follows:
<Transform>
<HierarchyLookup LookupTable="REF_GEOPLACES" RegisterMissingAsIssues="true">
<Lookups>
<Lookup Source="Admin0Raw" Lookup="Admin0" LookupFilter="GeoLevel = 2" LookupResult="Admin0Name" SourceResult="ResolvedAdmin0" />
<Lookup Source="Admin1Raw" Lookup="Admin1" LookupFilter="GeoLevel = 3" LookupResult="Admin1Name" SourceResult="ResolvedAdmin1" />
<Lookup Source="Admin2Raw" Lookup="Admin2" LookupFilter="GeoLevel = 4" LookupResult="Admin2Name" SourceResult="ResolvedAdmin2" />
</Lookups>
</HierarchyLookup>
</Transform>
CASE_ID | Admin0Raw | Admin1Raw | Admin2Raw | ResolvedAdmin0 | ResolvedAdmin1 | ResolvedAdmin2 |
---|---|---|---|---|---|---|
123 | BEN | Atacora | Kouandé | BEN | ATACORA | Kouandé |
123 | BEN | Donga | Bassila | BEN | DONGA | Bassila |
123 | BEN | DonGA | Unknown | BEN | DONGA | (null) |
All LookupResult and SourceResult in <Lookups>
or <Lookup>
are optional.
Support for Synonyms
The HierarchyLookup command can be configured to use a synonym table. A table of synonyms maps spelling variations in the source to the to fixed values in the lookup table.
Table REF_SYNONYMS example data
Synonym | Reference | Category |
---|---|---|
Bénin | BEN | Geoplaces |
Benin | BEN | Geoplaces |
In this example, both “Benin” and “Bénin” map to ISO3 code BEN. The value “BEN” must be in the lookup reference table. The synonym table is used in the HierarchyLookup command like this:
<Transform>
<HierarchyLookup LookupTable="REF_GEOPLACES" RegisterMissingAsIssues="true">
<SynonymTable Code="REF_GEOSYNONYMS" FromColumn="Synonym" ToColumn="Reference" Filter="Category = 'Geoplaces'" />
<Lookups LookupResult="ResolvedId" SourceResult="PlaceID">
<Lookup Source="Country" Lookup="Admin0" LookupFilter="GeoLevel = 2" />
<Lookup Source="Province" Lookup="Admin1" LookupFilter="GeoLevel = 3" />
<Lookup Source="District" Lookup="Admin2" LookupFilter="GeoLevel = 4" />
</Lookups>
</HierarchyLookup>
</Transform>
Note that the column names of the synonym table can be anything as they are configurable. Also, the “Category” column in the synonym table is optional but is useful if different kinds of synonyms for different purposes are stored in the same synonym table.
Table REF_SYNONYMS suggested data model
TABLE_CODE | FIELD_CODE | TITLE | Description | TYPE | IS_REQUIRED | IS_PRIMARY_KEY |
---|---|---|---|---|---|---|
REF_SYNONYMS | SYNONYM | Synonym | some other variation of a name | TEXT_450 | TRUE | TRUE |
REF_SYNONYMS | REFERENCE | Reference | the name as it appears in the reference table | TEXT_450 | TRUE | TRUE |
REF_SYNONYMS | CATEGORY | Category | can be used as an optional filter if some synonyms should only apply to a specific domain | TEXT_50 | FALSE | FALSE |
For example, if there are mixed ISO3 and country names in the same columns. Synonyms for Country name to ISO3 can be added.
Synonyms disambiguation thanks to children
What happens if the same synonym maps to different reference values?
For example, consider the following raw input data. AUS is ambiguous, as it could refer either to Australia (city = Sydney) or to Austria (city = Innsbruck).
id | country | province | city |
---|---|---|---|
333 | AUS | NSW | Sydney |
444 | AUS | Tyrol | Innsbruck |
555 | AUS | BOTH | Innsbruck |
666 | AUS | BOTH | Sydney |
The HierarchyLookup command is able to correctly resolve this using this synonym table:
- Table REF_Synonyms
In | Out | Category |
---|---|---|
AUS | Australia | Geoplaces |
AUS | Austria | Geoplaces |
How does HierarchyLookup know that sometimes AUS should map to Australia and sometimes map to Austria? It knows because synonyms are only resolved in the context of its parent and child values. In other words, HierarchyLookup doesn’t look at each level in isolation of the other; it looks at the combined result across all levels.
Example reference table Table REF_PLACES
REF_Id | REF_Country | REF_Province | REF_City | REF_Level |
---|---|---|---|---|
12 | Swiss | 1 | ||
20 | Australia | 1 | ||
8 | Austria | 1 | ||
201 | Australia | NSW | 2 | |
2010 | Australia | NSW | Sydney | 3 |
801 | Austria | Tyrol | 2 | |
8010 | Austria | Tyrol | Innsbruck | 3 |
In the reference table, only Australia has values NSW and Sydney and only Austria has values Tyrol and Inssbruck. During the lookup, the combination of these values will determine uniqueness of the result.
If a regular, non-hierarchical lookup (1 level only) is set up to only resolve admin level 0 (country), it won’t be able to resolve the country AUS as it would match multiple entries due to synonyms for AUS having 2 possible outcomes.
But by doing a lookup on admin level 0 and also admin level 1 and also admin level 2, HierarchyLookup will use the deeper level to resolve the upper one.
<Lookups LookupResult="ResolvedId" SourceResult="PlaceID">
<Lookup Source="Country" Lookup="REF_Country" LookupFilter="REF_Level = 1" />
<Lookup Source="Province" Lookup="REF_Province" LookupFilter="REF_Level = 2" />
<Lookup Source="District" Lookup="REF_City" LookupFilter="REF_Level = 3" />
</Lookups>
Here is how the matching process will look like.
- For row with id=333, trying to match first Lookup Country:
input data = 'AUS'
- Augment with synonyms found for
In = 'AUS' -> ('Austria', 'Australia')
- Found 2 entries in REF_PLACES for condition
REF_Country in ('AUS', 'Austria', 'Australia') and REF_Level = 1
, so no match yet. - Go to next level Lookup Province,
input data = 'NSW'
- Augment with synonyms found for
In = 'NSW' -> ()
- Found 1 single entry in REF_PLACES for condition
REF_Province in ('NSW') and REF_Level = 1
, so match. ✅ - If match, go back to parent lookup and resolve to NSW’s Ref_Country -> ‘Australia’
Resolution of missing parents
Similarly to above disamiguation of missing parents. If input data contains some row like
id | country | province |
---|---|---|
333 | (empty) | NSW |
But the reference REF_PLACES contains a unique entry for REF_Province = 'NSW' and REF_Level = 2
.
The HierarchyLookup will be able to resolve the country of this row to ‘Australia’.
For this resolution of empty values to be enabled, the attribute <Lookups TryIfEmptyParent="true" />
needs to be set to true. By default, it’s set to false so won’t try to resolve such rows.
Output table
id | country | province | resolvedCountry | resolvedProvince |
---|---|---|---|---|
333 | (empty) | NSW | Australia | NSW |