The document discusses how Puhui Finance, a Chinese P2P lending company, uses big data and AI techniques for risk control. It introduces their Feature Compute Engine, which converts unstructured user data into structured features, and their Knowledge Graph, which connects entities and analyzes relationships. Specific use cases discussed include anti-fraud detection using rules, contact recovery by building phone networks, and detecting high-risk individuals via search engines. Challenges around unstructured data, name disambiguation, reasoning and lack of training data are also covered.
1 of 59
Downloaded 112 times
More Related Content
Bigdata and ai in p2 p industry: Knowledge graph and inference
1. Big Data and AI in P2P
Industry
Wenzhe Li
nadalwz1115@gmail.com
Feb 1, 2016
4. In this talk, I will mainly focus on the
techniques used in lending side risk control.
Similar techniques can be applied to the
financing side.
What the talk is about
5. Outline
? Why need Big data and AI
? Intro to FC Engine and Knowledge Graph
? Case 1: Anti-Fraud
? Case 2: Lost Contact Recovery
? Case 3: Detect Bad People via Search
? More use cases
? Challenges
6. ? Credit system is not mature in China
? Targeting at under-served market, those who don¡¯t have
enough credit to borrow from bank
? The data solely from credit history is not enough to build the
scoring models
? More efficient application reviewing process is needed as we
move more transactions from offline to online
Why big data & AI
7. Outline
? Why need Big data and AI
? Intro to FC Engine and Knowledge Graph
? Case 1: Anti-Fraud
? Case 2: Lost Contact Recovery
? Case 3: Detect Bad People via Search
? More use cases
? Challenges
9. Measure the risk for a person
Individual
Feature
Analysis
Relation
Analysis
£¿
Knowledge Graph
Feature Compute(FC)
Engine
10. ? User explicitly input data (i.e. application form)
? Authorized* user data
? Mobile History
? Purchasing History
? ¡¡
? Open Search
?
? 360.com
? Others (i.e. craigslist)
? 3rd- party data (i.e. blacklist)
Data
Unstructured Data
* User authorizes us to use their data
12. Feature Compute Engine
Credit Card
Mobile History
Purchasing
......
Precision Marketing
Fraud Score
Risk Score
FeatureCompute
Engine
Feature Container
(tens of thousands)
Data
....
....
Data
Credit Card
History
Mobile
History
Purchasing
History
Feature Compute
Engine
Data
Scoring Model
13. Purchasing
History
i.e. Purchasing History
Total amount spent during the last 6 months
User level (i.e. Prime, Normal¡)
Total number of transactions during the last 6 months
The length of time he/she uses the account
Total number of transactions related to virtual products
Total number of transactions related to luxury products
¡¡¡
Few thousand
features
14. ? It is a semantic network
? Based on graph data structure, consists
of points and edges. Point represents
entity, edge represents relationship.
? Knowledge graph connects
heterogeneous information. It provides
the ability to analyze the data from the
perspective of relationship.
What is knowledge graph
26. ? 10 types of entities
? ~50 types of relations
? ~50M entities
? 0.2B relations
We expect that it will become ~20 times bigger by the end of this year due to
the business growth
Domain-specific knowledge graph
27. Outline
? Why need Big data and AI
? Intro to FC Engine and Knowledge Graph
? Case 1: Anti-Fraud
? Case 2: Lost Contact Recovery
? Case 3: Detect Bad People via Search
? More use cases
? Challenges
28. Applicant shares the
same personal phone
with other applicant
Phone
Applicant
Other
applicant
Personal Phone Personal Phone
Antifraud - rules
29. Applicant and other
applicant share the
same colleague phone,
but with different
company names
Phone
Applicant
Other
applicant
Colleague phone
Company 1 Company 2
Colleague phone
Antifraud ¨C rules (cont.)
36. LR
Decision Tree
Random Forest
SVM
ANN
Models Prediction
Extracted
Features from
Raw Data
Results from
anti-fraud
rules
User direct
attributes
Variables
DNN
Score is used to
directly reject or
accept the loan
Antifraud ¨C fraud score
score
37. Outline
? Why need Big data and AI
? Intro to FC Engine and Knowledge Graph
? Case 1: Anti-Fraud
? Case 2: Lost Contact Recovery
? Case 3: Detect Bad People via Search
? More use cases
? Challenges
38. The borrowers disappear, all the contact information they
explicitly provided become invalid. How to reach them?
Lost contact recovery ¨C what is it
Implicitly infer potential contact information
42. Simple Ranking Criteria
? The total length of time
? The frequency of calls
Advanced Approach
? Learning the ranking score using machine learning approach
Building phone network ¨C Rank
43. ? Total # of times of calling
? Total length of time of
calling
? Total # of times of being
called
? Total # of times of calling
? Average time per call
? Maximum length of time
? # of times of calling
between 0-4am
? # of times of calling
between 4-8am
? ¡¡
Building phone network ¨C Predict the relation
LR
Decision Tree
Random Forest
SVM
ANN
Models
Prediction of relation
~100 Features
DNN
Relation
With very limited
training data, our
model provides
~30% accuracy
45. Outline
? Why need Big data and AI
? Intro to FC Engine and Knowledge Graph
? Case 1: Anti-Fraud
? Case 2: Lost Contact Recovery
? Case 3: Detect Bad People via Search
? More use cases
? Challenges
46. Detect Bad People via Search
From the search results, we label each
entities in the knowledge graph i.e. black,
green etc.
47. ?
? 360.com
? other public
websites
Search for basic information¡.
? Phone number
? Email
? QQ
? Other IDs
Search Fields Search Engines & Public Site
51. Outline
? Why need Big data and AI
? Intro to FC Engine and Knowledge Graph
? Case 1: Anti-Fraud
? Case 2: Lost Contact Recovery
? Case 3: Detect Bad People via Search
? More use cases
? Challenges
52. Challenges : Unstructured Data
Unstructured
Data
Images
Text
AudioVideo
Machine Learning
Natural Language
Processing
Data Mining
53. Challenges : Name Disambiguation
Applicant
Other
applicant
Puhui
Finance
Ltd.
Puhui
Finance
Same company, can
we merge?
It is a very important
problem to deal with!
54. Challenges : Reasoning
However, It is still an open problem
? Logic-based approach
? Probabilistic approach (i.e. distributed representation)
? Hybrid approach
Link Prediction
56. ? Senior/Lead Machine Learning/NLP Engineers
? Senior/Lead Data Engineer/Scientist
? Senior/Lead Architect
? Senior/Lead Software Engineer
liwenzhe@puhuifinance.com
zhaopin@puhuifinance.com
We are hiring! (in Beijing)
Open positions, but not limited to¡.
Contact
Company Website
www.puhuifinance.com
58. [1] http://www.datapop.com/
[2] http://db-engines.com/en/blog_post//43
[3] http://db-engines.com/en/ranking
[4] Bordes, Antoine, et al. "Translating Embeddings for Modeling Multi-
relational Data." Advances in Neural Information Processing
Systems(2013):2787-2795.
[5] Nickel, Maximilian, V. Tresp, and H. P. Kriegel. "A Three-Way Model
for Collective Learning on Multi-Relational Data.." International
Conference on Machine Learning 2011:809-816.
References
59. [6] Richard Socher, Danqi Chen, Christopher D. Manning, Andrew Ng.
Reasoning With Neural Tensor Networks for Knowledge Base
Completion. Advances in Neural Information Processing Systems(2013)
[7] Wang, Quan, Wang, Bin, and Guo, Li. "Knowledge base completion
using embeddings and rules." Proceedings of the 24th International
Conference on Artificial Intelligence AAAI Press, 2015.
[8] T Rockt?schel£¬S Singh£¬S Riedel. Injecting Logical Background
Knowledge into Embeddings for Relation Extraction
http://talks.cam.ac.uk/talk/index/58360
References