Quality Systems Services
Copyright (c) Quality Systems Services 1995 - 2019
Name Data - The Use and Importance of Names
Name data has many uses in business, finding applications in:
Name Handling Technical Information...
Accurate name data is important to both your business and your customers. For your business it represents essential information for marketing and communication with individual customers, and a valuable independent information source for de- duplication of your databases. For your customers, a correctly spelled name on a mailing represents a high level of personal attention and competence from your business, where a misspelled name inspires no confidence in your business’s care towards the customer or your attention to detail. A mailing to an inappropriate name, for example the name of a deceased person, or a spoof name maliciously supplied, may cause offence to the customer.
What's in a Name?
Names are a rich source of information. In the simplest case we might extract from a name the following components:
This information is typically explicitly represented in the name. In addition, with appropriate processing we might extract more detailed personal information about an individual:
Representation of Name Data
We can see that a strategy for utilising name data to it’s full potential should have several properties; it must provide easy access to the explicit information in the name, and implicit information such as gender which can be useful for tasks such as de-duplication; it must be able to manipulate the name data flexibly to meet the varying needs of tasks including indexing, de-duplication, addressing, and salutations. For example, the name 'John Smith' might optimally be represented in several forms depending on the task:
|
Gender |
e.g. Miss J. Smith |
|
Marital Status |
e.g. Mrs J. Smith |
|
Qualifications |
e.g. Dr. J. Smith PhD |
|
Sensitive Information |
e.g. Mr J. Smith (deceased) |
|
Inappropriate Information |
e.g. Mr Donald Duck |
|
Titles |
|
Forenames |
|
Initials |
|
Surnames |
|
Marketing |
|
De-duplication |
|
Data cleansing |
|
De-duplication |
'MR J SMITH' or 'JOHN SMITH' |
|
Addressing |
'Mr. J. Smith' |
|
Salutation |
'Dear Mr. Smith' |
The clear choice for representation is an explicit field format in which we have fields for title, initials, forenames, surname etc. This representation gives maximum flexibility to manipulate the name for the variety of uses to which it may be put and makes algorithms for tasks such as the generation of correct salutations straightforward.
Acquisition of Name Data
Often when acquiring personal data from disparate data sources we find that a variety of name formats may have been adopted. In more mature data sources we would expect to find some variation of a field-based format, which may or not be as complete as is desirable, for example we might find a name such as 'Mr. John H. Smith' represented in various ways:
Variability may be encountered in conventions of punctuation, casing, completeness of fields, assignment of data to fields etc. Due to incorporation of data from multiple sources and by multiple data capture operators we may find much inconsistency across a single data set, much more so across multiple data sets. Clearly when consolidating such data it is desirable to adopt a single comprehensive format, standardising and validating all data against this format.
Free Format Names
A more difficult situation which unfortunately is often encountered is the representation of an individual’s name in a single 'free format name' field. In this format we may encounter a wide variety of name forms, for example:
|
Dr J McDonald |
|
Dr. J. McDonald |
|
Doctor John Mc Donald |
|
Doctor John Mc-Donald |
|
Dr. J. McDonald-Smythe |
|
Dr. and Mrs. McDonald |
|
Dr J / Mrs L McDonald |
|
Dr J. McDonald PhD. |
|
Dr J McDonald (deceased) |
Here we see just a few sources of variability - capitalisation, punctuation, spacing, alternative forms for titles, hyphenation, fields representing more than one person, and additional information such as qualifications and annotations. Processing such free format data clearly requires very flexible processing techniques. Traditional approaches to processing such data have been naïve and applied in an ad hoc fashion on an individual dataset basis, for example assume first word is title, last is surname etc. Such simple processing is unpredictable and has many limitations, for example it cannot cope reliably with data in which the free format varies, cannot recognise spelling mistakes or non- name data, and so on. Often one encounters field-formatted names in which errors have clearly been introduced by inadequate techniques for conversion from free format names.
Corrupted and Non-Name Data
Additional problems are presented by errors in the name data which may have originated at the data entry stage or by inadequate data consolidation techniques, for example:
Title |
Forename |
Initials |
Surname |
Full Initials |
Mr. |
John |
H. |
Smith |
J. H. |
Mr |
John |
J H |
Smith |
|
MR |
JOHN |
H |
SMITH |
|
|
Misspelled data |
e.g. Mis Smith |
|
Corrupted data |
e.g. XX Mr Smith |
|
Company names |
e.g. 'Dixons The Butchers' |
|
Non-name data |
e.g. '17 Smith Street' |
|
Undesirable data |
e.g. 'Mr F. Flintstone' |
Other problems often found include:
Identification of non-name or undesirable data and correction of misspelled or corrupted names is clearly a very important task. Such processing has however been far beyond the abilities of conventional name processing systems.
NameBase
To address the needs for flexible name processing outlined above, QSS has developed an entirely new name processing system, NameBase. The system has been custom designed, incorporating innovative data representations and processing techniques. NameBase processes both field-formatted and free format name data from any database source, with capabilities including:
|
Validation of field-formatted names |
|
Conversion of free format names to field format |
|
Flexible user-defined field format |
|
Standardise capitalisation, punctuation, hyphenation etc. |
|
Expand or abbreviate titles |
|
Spelling and punctuation correction |
|
Automatic insertion/correction of punctuation |
|
Automatic correction of spelling |
|
Suggestions for ambiguous spelling correction |
|
Detection of non-name data |
|
Company names |
|
Addresses, delivery notes etc. |
|
Undesirable, deceased, spoof names |
|
Creation of value-added information |
|
Full initials from forenames |
|
Determination of gender |
|
Gender-correct default titles for salutation etc. |
|
Reason for failing name |
|
Type of editing applied to correct name |
Example Output
Perhaps the most compelling demonstration of the capabilities of NameBase is to examine some example output. This table shows a few examples of free format name fields taken from real databases and their processing by NameBase. In this case the task was to reformat free format names into separate fields, standardising casing and punctuation, abbreviating titles, extracting gender where possible, and generating a full set of initials for de- duplication. Name entries representing multiple people were to be split by NameBase into multiple records.
The 'Status' field encodes the status of processing for a record.
These records show examples of exact name matches, with status 'E' (EXACT), with a variety of casing, spacing and punctuation, and nontrivial surname 'De Hutiray'.
ID |
Name |
Status |
Person |
Title |
Forenames |
Initials |
Surname |
Gender |
Full Initials |
1 |
MR D. ALLEN |
E |
1 |
Mr. |
|
D. |
Allen |
M |
D. |
2 |
Mr A G B De Hutiray |
E |
1 |
Mr. |
|
A. G. B. |
De Hutiray |
M |
A. G. B. |
3 |
Captain Frank Gurney |
E |
1 |
Capt. |
Frank |
|
Gurney |
M |
F. |
4 |
A.Tottingham |
E |
1 |
|
|
A. |
Tottingham |
U |
A. |
These show how NameBase has recognised a field representing two people, and split this into two records, assigning titles and initials correctly to the appropriate individuals. Other options supported include keeping both individuals as a single record with compound title field.
ID |
Name |
Status |
Person |
Title |
Forenames |
Initials |
Surname |
Gender |
Full Initials |
5 |
Mr M J & Mrs Davidson |
E |
1 |
Mr. |
|
M. J. |
Davidson |
M |
M. J. |
|
Mr M J & Mrs Davidson |
E |
2 |
Mrs. |
|
|
Davidson |
F |
|
6 |
Rev & Mrs Payne |
E |
1 |
Rev. |
|
|
Payne |
U |
|
|
Rev & Mrs Payne |
E |
2 |
Mrs. |
|
|
Payne |
F |
|
These records have status 'EE' (EXACT with EDIT) indicating that NameBase has successfully extracted a name after a simple editing of the field. Here are examples of corrupted entries, typing errors, and superfluous components e.g. 'Attn:'. In each case the correct name is extracted automatically.
ID |
Name |
Status |
Person |
Title |
Forenames |
Initials |
Surname |
Gender |
Full Initials |
7 |
A Mr TURNER |
EE |
1 |
Mr. |
|
|
Turner |
M |
|
8 |
,ISS C STYLES |
EE |
1 |
Miss |
|
C. |
Styles |
F |
C. |
9 |
. Mrs Marsden |
EE |
1 |
Mrs. |
|
|
Marsden |
F |
|
10 |
Attn: Chris Thompson |
EE |
1 |
|
Chris |
|
Thompson |
U |
C. |
11 |
Attn~: Mrs Shipley |
EE |
1 |
Mrs. |
|
|
Shipley |
F |
|
12 |
C.O Mrs Robertshaw |
EE |
1 |
Mrs. |
|
|
Robertshaw |
F |
|
13 |
C/O Mr Pearce |
EE |
1 |
Mr. |
|
|
Pearce |
M |
|
14 |
Customer Miss Mead |
EE |
1 |
Miss |
|
|
Mead |
F |
|
This record shows an example of an ambiguous name field (status 'A'). In this case NameBase provides two interpretations of the entry 'MR C O TOOLE', either 'Mr. C. O. Toole', or 'Mr. C. O’Toole' since it has recognised that omission of punctuation from the input may have hidden the common surname 'O’Toole'. NameBase provides options to control how such ambiguities are processed, selecting a best guess or leaving the choice for manual post-processing.
ID |
Name |
Status |
Person |
Title |
Forenames |
Initials |
Surname |
Gender |
Full Initials |
15 |
MR C O TOOLE |
A |
1 |
Mr. |
|
C. O. |
Toole |
M |
C. O. |
|
MR C O TOOLE |
A |
1 |
Mr. |
|
C. |
O'Toole |
M |
C. |
This record gives another ambiguous example, with status 'I' (INEXACT). In this case the surname 'Hutchson' appears to be a misspelling and NameBase suggests six corrections to the spelling.
NameBase Technology
These tables of example results should demonstrate clearly the flexibility with which NameBase is able to treat name data, operating correctly on complex multi- person names in the presence of typing errors, misspelling, incorrect formatting of input fields, and many other errors encountered in real world databases. The key to the system’s accuracy and robustness is the specific system design adopted by QSS and unique to NameBase. Key features include:
ID |
Name |
Status |
Person |
Title |
Forenames |
Initials |
Surname |
Gender |
Full Initials |
23 |
Anglia Coop |
EC |
1 |
|
Anglia |
|
Coop |
F |
A. |
24 |
Mr M Mouse |
EU |
1 |
Mr. |
|
M. |
Mouse |
M |
M. |
ID |
Name |
Status |
Person |
Title |
Forenames |
Initials |
Surname |
Gender |
Full Initials |
16 |
E W HutchsonI |
1 |
|
|
|
E. W. |
Hutchason |
U |
E. W. |
|
E W HutchsonI |
1 |
|
|
|
E. W. |
Hutcheon |
U |
E. W. |
|
E W HutchsonI |
1 |
|
|
|
E. W. |
Hutcheson |
U |
E. W. |
|
E W HutchsonI |
1 |
|
|
|
E. W. |
Hutchison |
U |
E. W. |
|
E W HutchsonI |
1 |
|
|
|
E. W. |
Hutchon |
U |
E. W. |
|
E W HutchsonI |
1 |
|
|
|
E. W. |
Hutchson |
U |
E. W. |
These records show examples of inputs which have been rejected by NameBase. Record 17 is an example of an address line (status 'NA' - NO MATCH: ADDRESS), records 18-21 are correctly failed as company names (status 'NC' - NO MATCH: COMPANY) and record 22 is given status 'ND' - NO MATCH: DECEASED indicating that the name refers to a deceased individual.
ID |
Name |
Status |
Person |
Title |
Forenames |
Initials |
Surname |
Gender |
Full Initials |
17 |
165 AINSLIE STREET |
NA |
1 |
|
Ainslie |
|
Street |
U |
A. |
|
165 AINSLIE STREET |
NA |
1 |
|
|
|
Ainslie Street |
U |
|
18 |
BEVAN FUNNELL LTD |
NC |
1 |
|
Bevan |
|
Funnell |
M |
B. |
|
BEVAN FUNNELL LTD |
NC |
1 |
|
|
|
Bevan Funnell |
U |
|
19 |
Vale Royal Fresh Foods |
NC |
1 |
|
Vale |
|
Royal Fresh |
U |
V. |
|
Vale Royal Fresh Foods |
NC |
1 |
|
Vale Royal |
|
Fresh |
U |
V. R. |
|
Vale Royal Fresh Foods |
NC |
1 |
|
|
|
Vale Royal Fresh |
U |
|
|
Vale Royal Fresh Foods |
NC |
1 |
|
Royal |
|
Vale Fresh |
U |
R. |
20 |
Aaa Appliances |
NC |
0 |
|
|
|
|
U |
|
21 |
Anglia Co-OP |
NC |
1 |
|
Anglia |
|
|
F |
A. |
22 |
Mr J Smith deceased |
ND |
1 |
Mr. |
|
J. |
Smith |
M |
J. |
These records give examples of entries which match exactly as names but are identified by NameBase as 'suspicious', record 23 (status 'EC' - EXACT but COMPANY like) because it looks like a company name, and record 24 (status 'EU' - EXACT but UNDESIRABLE) because while being a valid name it is potentially undesirable (Mr. Mickey Mouse?).
|
Very large database of categorised name components |
|
Formal model of valid name forms |
|
Formal error correction rules |
|
System for explaining and auditing matches |
|
Open architecture |
At the heart of NameBase lie two main components - a very large database of categorised name components, and an advanced pattern matching engine using formal specifications of valid name forms and error correction strategies.
Name Component Database
Using such a large database, which contains not only conventional name components such as forenames and surnames, but also surname prefixes, alternative title forms, company name indicators, etc. all annotated with information including classification, gender and frequency information gives NameBase a unique lead over conventional name processing which has tended to rely on lists of a few tens or hundreds of name components. A proprietory data format allows lightning speed access to the database while supporting ultra- efficient searches for misspelled entries.
Pattern Matching Engine
Utilising the database is an advanced pattern matching engine which uses formal specifications of valid name forms to interpret input fields, constituting a well-defined model of what is and isn’t a valid name. The engine in addition uses formally defined error correction strategies to correct errors in input, reinterpret ambiguous name components or edit input to achieve an interpretation. Whereas conventional approaches to name processing have used hard-coded procedures for processing input, the use by NameBase of formally-defined techniques allows the system to explain and justify editing decisions made during processing. Conventional systems have typically only been able to output a “best guess”; NameBase by contrast outputs a choice of interpretations if the input is ambiguous, status codes indicating reasons for failure, success, or ambiguity of the input, and can output detailed information of the editing performed to match a corrupted input so that intelligent selective post- processing of the results may be carried out to ensure maximal accuracy.
Open Architecture and Data-centric Approach
A third key element to the design of NameBase is it’s open architecture. The system is not tied to any particular platform, giving maximum flexibility for incorporation your existing system environments. An important aspect is that the system’s approach is essentially data-centric. Output data from the system contains rich information for manual post-processing meaning that such processing can be carried out on existing database terminals with low requirements in terms of processing power. Incorporation into complex multi- user environments becomes a simple task supported by the existing database infrastructure.