�ݺ�ߣ

Building Similar Entity
Recognizers

By Arthi Venkataraman

1 © 2012 WIPRO LTD | WWW.WIPRO.COM | INTERNAL

Agenda
Similar Entity Detection Scenarios

Challenges, Techniques and Algorithms

Semantic applicability

Big Data Challenges and Solution

Sample Results


Scenario 1 - Fraud Detection in Insurance Claims

Are P1 , P2 and
P3 same?
Is there FRAUD

P1
P2
Tom Harold Tom H makes claim
makes claim on P3 on Policy 2
Policy 1 T Harold makes a claim on
Policy 3


Scenario 2 – Cross Sell Potential Detection in Insurance

Does Tom Harold hold a policy in any other
system. What are the policies he holds. Is
there a potential for cross sell.

Tom Harold holds Policy 1 in System 1
He is high net-worth customer


Example features for different people

Person 1 Person 2 Person 3

• First Name – • First Name – • First Name –
Tom Tom Tom
• Middle Name - • Middle Name - • Last Name -
• Last Name - Harry Harold
Harold • Last Name - • Date of Birth –
• Date of Birth – Harold 20/10/1988
20/10/1987 • Date of Birth – • Address -
• Address - 20/10/1987 1, Mahatma
1, MG • Address - Gandhi
Road, Bangalore 1, Mahatma Rd, Bangalore -
– 56 Gandhi 560056
Rd, Bangalore -
560056

Questions :
• Is Person 1 same as Person2 ?

Similar Entity Detection Challenges

Quick manual inspection of Person 1 and Person 2 feature data to conclude that Person 1 is same as
Person 2

Not so trivial for a machine

Weightages must be arrived at for different features

Code is needed for identifying if values of a feature for person 1 and person 2 are similar or different
A similar string comparison is not sufficient - Is MG Road same as Mahatma Gandhi Rd

Actual data will have some spelling mistakes, missing data and wrongly entered data. For e.g.
20/10/1987 could be entered as 20/10/1988 Or the field itself could be empty

Hence need other techniques like machine learning and semantic techniques


Similar Entity Detection Methodology
Given two entities how can we say that two entities are same

• Identify relevant features
Step 1

• Extract values for the features
Step 2

• Create a model which can classify the
Step 3 two entities as same or different

• Use the model to classify future
Step 4 customer pairs


Supervised Learning model

Labeled pre-identified customer pairs data as inputs

Values for different features for each of the customer pairs

Each customer pair is tagged as Same, Probably Same, Different

A supervised algorithm is chosen - ( Actual algorithm based on data characteristics )

The tagged data is fed as input

Output is the model

Model will classify a new customer pair into one of the identified categories

Model accuracy can be calculated using the Precision, Recall, Accuracy and F-Scores


Supervised Learning model example

• Live example of how to classify a given set of customer records using
Supervised Learning


Un-Supervised Learning model

In many cases there is no pre-labeled data

In this case we would need to choose an Un-
supervised learning model

The model will automatically detect patterns in the
data and cluster the data points into different clusters

Any newly added customer pair would be placed in
the right cluster


Un-Supervised Learning model example

• Live example of how to classify a given set of customer records using
Un-Supervised Learning


Continuous Learning

In many cases there will be some small set
of labeled data and very large set of un-
labeled data

An initial model will be created using the
small labeled data set

As more labeled data is available the model
will evolve due to continuous learning and
become more better at the classification


Semantic Techniques applicability

Semantic similarity scoring for features
• Feature - List of Games played
• Person 1 plays – Racquet Sports
• Person 2 plays – Lawn Tennis
• Using semantic comparison we can see that there is a high similarity between
person 1 and person 2 on the List of Games played feature

Extraction of features from different data sources
• Similar features named differently

Associating customers in different data sources as same or different
• Flexibly and easy addition of new relationships

Ease of adding additional data sources


Large Data handling challenges

Entity similarity is a pair wise operation

If there are n entities then there n*(n-1)
number of comparisons to be done

Also within each comparison for every
feature pair has to be compared

Highly time consuming operations


Large Data handling ideas

Use of Apache Mahout
• Split the comparison into m
different machines
• Each machine now handles
- n/m customer
• Nearly an m time speed-up

Batch time
Incremental comparisons and
addition of new tagging
customer pairs • Reduce run time
response to find similar
entities


Sample Metrics from our experiments

• Discussion on the sample metrics from our experiments
• Learning from same
– Which algorithm and method was more apt under different
circumstances


Thank You


�ݺ�ߣ

Building similarentityrecognizerv1

More Related Content

Building similarentityrecognizerv1