Typical use case

You have case data with free-text subnational administrative place names. Each case record has these fields: Admin0Raw (ex: Country), Admin1Raw (ex: Province), Admin2Raw (ex: District).

You need to validate and match the freetext names with a reference table of geoplaces. Let’s call the reference table for geography REF_GEOPLACES. We will first show how to do a hierarchy lookup directly against the REF_GEOPLACES table, and then we will show how to use a synonym table to translate any freetext spelling variations to the reference table.

Example Table CASES (your input data)

CASE_ID Admin0Raw Admin1Raw Admin2Raw
123 BEN ATACORA Kouandé
678 BEN DONGA Bassila
879 BEN DonGA Unknown

Table REF_GEOPLACES

Typical geography reference table

PlaceID Admin0 Admin1 Admin2 GeoLevel
27d0edf7 BEN     2
741b3ef2 BEN ATACORA   3
b87f5228 BEN ATACORA Kouandé 4
25cb5755 BEN DONGA   3
7ad5511f BEN DONGA Bassila 4

The goal is to create a new PlaceID column in the input CASES table to hold the value from the matching PlaceID column from the REF_GEOPLACES table. Any failed lookup will be reported as a batch issue.

Note that matching is case insensitive.

Here is how to write such a lookup command. “Source” refers to the input CASES table. “Lookup” refers to the reference REF_GEOPLACES table.

  <Transform>
    <HierarchyLookup LookupTable="REF_GEOPLACES" RegisterMissingAsIssues="true">
      <Lookups LookupResult="ResolvedId" SourceResult="PlaceID">
        <Lookup Source="Admin0Raw" Lookup="Admin0" LookupFilter="GeoLevel = 2" />
        <Lookup Source="Admin1Raw" Lookup="Admin1" LookupFilter="GeoLevel = 3"  />
        <Lookup Source="Admin2Raw" Lookup="Admin2" LookupFilter="GeoLevel = 4"  />
      </Lookups>
    </HierarchyLookup>
  </Transform>

The order of the <Lookup> elements is defining the actual hierarchy of parent - child relationships. They must be ordered from the higher level to the lower one.

After resolution

In this example, we only requested the final PlaceID of the resolved geography (see last column at the end).

CASE_ID Admin0Raw Admin1Raw Admin2Raw PlaceID
123 Ben Atacora Kouandé b87f5228 (PlaceID of resolved admin2)
123 BEN DonGA Bassila 7ad5511f (PlaceID of resolved admin2)
123 BEN DonGA Unknown 25cb5755 (admin3 not resolved so PlaceID of resolved admin1)

The command is also able to add a new column for each level of lookup that has been resolved. In each Lookup command, indicate which column from the reference table (LookupResult) should be put into which column of the cases table (SourceResult), as follows:

  <Transform>
    <HierarchyLookup LookupTable="REF_GEOPLACES" RegisterMissingAsIssues="true">
      <Lookups>
        <Lookup Source="Admin0Raw" Lookup="Admin0" LookupFilter="GeoLevel = 2" LookupResult="Admin0Name" SourceResult="ResolvedAdmin0" />
        <Lookup Source="Admin1Raw" Lookup="Admin1" LookupFilter="GeoLevel = 3" LookupResult="Admin1Name" SourceResult="ResolvedAdmin1" />
        <Lookup Source="Admin2Raw" Lookup="Admin2" LookupFilter="GeoLevel = 4" LookupResult="Admin2Name" SourceResult="ResolvedAdmin2" />
      </Lookups>
    </HierarchyLookup>
  </Transform>
CASE_ID Admin0Raw Admin1Raw Admin2Raw ResolvedAdmin0 ResolvedAdmin1 ResolvedAdmin2
123 BEN Atacora Kouandé BEN ATACORA Kouandé
123 BEN Donga Bassila BEN DONGA Bassila
123 BEN DonGA Unknown BEN DONGA (null)

All LookupResult and SourceResult in <Lookups> or <Lookup> are optional.

Support for Synonyms

The HierarchyLookup command can be configured to use a synonym table. A table of synonyms maps spelling variations in the source to the to fixed values in the lookup table.

Table REF_SYNONYMS example data

Synonym Reference Category
Bénin BEN Geoplaces
Benin BEN Geoplaces

In this example, both “Benin” and “Bénin” map to ISO3 code BEN. The value “BEN” must be in the lookup reference table. The synonym table is used in the HierarchyLookup command like this:

  <Transform>
    <HierarchyLookup LookupTable="REF_GEOPLACES" RegisterMissingAsIssues="true">
      <SynonymTable Code="REF_GEOSYNONYMS" FromColumn="Synonym" ToColumn="Reference" Filter="Category = 'Geoplaces'" />
      <Lookups LookupResult="ResolvedId" SourceResult="PlaceID">
        <Lookup Source="Country"  Lookup="Admin0" LookupFilter="GeoLevel = 2" />
        <Lookup Source="Province" Lookup="Admin1" LookupFilter="GeoLevel = 3" />
        <Lookup Source="District" Lookup="Admin2" LookupFilter="GeoLevel = 4" />
      </Lookups>
    </HierarchyLookup>
  </Transform>

Note that the column names of the synonym table can be anything as they are configurable. Also, the “Category” column in the synonym table is optional but is useful if different kinds of synonyms for different purposes are stored in the same synonym table.

Table REF_SYNONYMS suggested data model

TABLE_CODE FIELD_CODE TITLE Description TYPE IS_REQUIRED IS_PRIMARY_KEY
REF_SYNONYMS SYNONYM Synonym some other variation of a name TEXT_450 TRUE TRUE
REF_SYNONYMS REFERENCE Reference the name as it appears in the reference table TEXT_450 TRUE TRUE
REF_SYNONYMS CATEGORY Category can be used as an optional filter if some synonyms should only apply to a specific domain TEXT_50 FALSE FALSE

For example, if there are mixed ISO3 and country names in the same columns. Synonyms for Country name to ISO3 can be added.

Synonyms disambiguation thanks to children

What happens if the same synonym maps to different reference values?

For example, consider the following raw input data. AUS is ambiguous, as it could refer either to Australia (city = Sydney) or to Austria (city = Innsbruck).

id country province city
333 AUS NSW Sydney
444 AUS Tyrol Innsbruck
555 AUS BOTH Innsbruck
666 AUS BOTH Sydney

The HierarchyLookup command is able to correctly resolve this using this synonym table:

  • Table REF_Synonyms
In Out Category
AUS Australia Geoplaces
AUS Austria Geoplaces

How does HierarchyLookup know that sometimes AUS should map to Australia and sometimes map to Austria? It knows because synonyms are only resolved in the context of its parent and child values. In other words, HierarchyLookup doesn’t look at each level in isolation of the other; it looks at the combined result across all levels.

Example reference table Table REF_PLACES

REF_Id REF_Country REF_Province REF_City REF_Level
12 Swiss     1
20 Australia     1
8 Austria     1
201 Australia NSW   2
2010 Australia NSW Sydney 3
801 Austria Tyrol   2
8010 Austria Tyrol Innsbruck 3

In the reference table, only Australia has values NSW and Sydney and only Austria has values Tyrol and Inssbruck. During the lookup, the combination of these values will determine uniqueness of the result.

If a regular, non-hierarchical lookup (1 level only) is set up to only resolve admin level 0 (country), it won’t be able to resolve the country AUS as it would match multiple entries due to synonyms for AUS having 2 possible outcomes.

But by doing a lookup on admin level 0 and also admin level 1 and also admin level 2, HierarchyLookup will use the deeper level to resolve the upper one.

<Lookups LookupResult="ResolvedId" SourceResult="PlaceID">
  <Lookup Source="Country"  Lookup="REF_Country" LookupFilter="REF_Level = 1" />
  <Lookup Source="Province" Lookup="REF_Province" LookupFilter="REF_Level = 2" />
  <Lookup Source="District" Lookup="REF_City" LookupFilter="REF_Level = 3" />
</Lookups>

Here is how the matching process will look like.

  1. For row with id=333, trying to match first Lookup Country: input data = 'AUS'
  2. Augment with synonyms found for In = 'AUS' -> ('Austria', 'Australia')
  3. Found 2 entries in REF_PLACES for condition REF_Country in ('AUS', 'Austria', 'Australia') and REF_Level = 1, so no match yet.
  4. Go to next level Lookup Province, input data = 'NSW'
  5. Augment with synonyms found for In = 'NSW' -> ()
  6. Found 1 single entry in REF_PLACES for condition REF_Province in ('NSW') and REF_Level = 1, so match. ✅
  7. If match, go back to parent lookup and resolve to NSW’s Ref_Country -> ‘Australia’

Resolution of missing parents

Similarly to above disamiguation of missing parents. If input data contains some row like

id country province
333 (empty) NSW

But the reference REF_PLACES contains a unique entry for REF_Province = 'NSW' and REF_Level = 2. The HierarchyLookup will be able to resolve the country of this row to ‘Australia’.

For this resolution of empty values to be enabled, the attribute <Lookups TryIfEmptyParent="true" /> needs to be set to true. By default, it’s set to false so won’t try to resolve such rows.

Output table

id country province resolvedCountry resolvedProvince
333 (empty) NSW Australia NSW