Menu Home

A is for Algorithm

So, how do we actually match our data to the watchlist data? There are two components to this, really: the mechanism by which we match, and what we actually match against. The matching mechanism is, ultimately, boiled down to some flavor of text search and matching technology.

In broad strokes, there are three kinds of matching algorithms: exact matching, fuzzy matching and phonetic matching. They each have their pros and cons.

Exact matching is just what it sounds like: match the tokens (like words, but includes strings of numbers and mixed items like passport numbers or SWIFT addresses that are alphanumeric) in the data being screened to the tokens derived from the regulatory list. A prominent exact matching methodology is known colloquially as the Clearing House Algorithm, which was developed by the Clearing House, which runs the CHIPS wholesale funds transfer network and the EPN automated clearinghouse (ACH), back in the 80s.

The good news about exact matching is two-fold: it’s fast and it has the lowest rate of false positives (matches to data that isn’t actually the listed person or entity) of the various technologies. The bad news is that it has limited ability to catch common typographical errors.

Fuzzy matching (or fuzzy search) is similar to exact matching, but will also match data that is “close” to the regulatory data. The quality threshold, which determines how good a match is required for an item to need review, is usually tunable. The result of the fuzzy matching algorithm is often called a “score”, which we’ll use here for convenience. Although it’s not a precise way of looking at it, the threshold is roughly the percentage that is properly matched. Generally, fuzzy matching relies on a concept called “edit distance”, which is a way of counting how many changes would have to be made to one token in order to be identical to another. Popular edit distance algorithms include those by Levenshtein and Hamming. So, JONES and JANE might have an edit distance of 2, for the substitution between the O and the A, and the extra (or missing, depending on your perspective) S. So, using a possible calculation, the resulting match generates a score of roughly 5 (the length of JONES) – 2 (the edit distance) / 5 = 60%.

The big advantage of fuzzy search is that it permits firms with higher risk profiles to cast wider nets, and find matches with a certain level of typos. The disadvantages are equally significant: fuzzy search is slower, and can generate a significantly larger number of matches (part of why it’s slower).

In fact, fuzzy matching generates what’s known as “hockey stick data.” At high matching thresholds, the number of additional matches (compared to exact matching) is small and doesn’t change much for small changes to the threshold, like the blade of a hockey stick. Imagine, for example, a threshold of 98% – it’s hard to get a score that high without trying to match a very long phrase (around 50 characters long).

On the other hand, at some point, the curve inflects and starts rising rapidly, like the shaft of a hockey stick. When does that happen? When the average length of what you’re matching against gets more common, the number of matches rises commensurately. So, a 91% threshold will, generally, let you find errors in 12 character data, and 90% will include 10 and 11-character data. Since you typically match multiple words (like first and last names of individual terrorists or drug traffickers), that’s not totally unreasonable; consider common Hispanic last names of HERNANDEZ (9 characters long), GONZALEZ (8 characters long) and RODRIGUEZ (9 characters long), coupled with common first names like MARIA or JOSE can generate fuzzy matches at 91% and, in some cases, higher percentages.

Phonetic matching algorithms match words that sound alike. There are quite a few phonetic algorithms in use, the most popular being Soundex, Metaphone and Double Metaphone. These algorithms are useful when the assumption is that the information is being heard, not seen. There are a number of downsides, These algorithms are not designed to catch typos, and, as they are based on pronunciation of the words, are language and accent-specific. For example, Diatch-Mokotoff Soundex is best for names of Slavic or Germanic heritage, while Caverphone is optimized for New Zealand accents.

This is only part of the story, of course – there are many options that make these algorithms more or less effective. But that’s the subject of another post – this one’s long enough as is.

Categories: General Information Matching Technologies Technology

eric9to5

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

<span>%d</span> bloggers like this: