際際滷

際際滷Share a Scribd company logo
Expanding on Gender Diversity Report :
NamSor algorithms for classification of names by
Race/ethnicity or cultural origin/diasporas
NamSor
1
2018-01
Gender, race/ethnicity or origin bias in AI ?
Algorithms are used to assist
human decision in funnel-based
processes, ex.
- recruitment,
- credit allocation,

AI especially used in the early
stage of the selection process (ex
resume sourcing or screening) :
search, scoring, tagging 
Is the algorithm FAIR?
2
Estimating gender, racial/ethnic bias in
algorithms ex. recruitment
Two approaches :
1) Use Aequitas, an open source
bias audit toolkit developed by
the Center for Data Science and
Public Policy at University of
Chicago
2) Measure changes in diversity
index (Shannon or Simpson) at
each selective step
What taxonomy for diversity
analytics? What is race/ethnicity ?
3
NamSor sorts Names
4
Names reflect cultural Identity
Since 2012, NamSor data mining software
recognizes the linguistic or cultural origin
of names in any alphabet / language,
using both supervised and unsupervised
machine learning (ie. clustering).
2014 : launch Gender API v1
2018 : software is re-written from scratch with standard ML
frameworks : 1/ name embedding + neural networks 2/ na誰ve
bayes classifier
2019 : launch NamSor API v2 with Gender, US Race/Ethnicity,
Country/Origin/Diaspora classifiers
Our proud contribution to Gender Reports
 NamSor Gender API (v1) was used independently by both by Science-Metrix and
Elsevier in 2015 and 2017
 NamSor Gender API V2 was used for The Researcher Journey Through a Gender
Lens and weve made specific improvements :
 Enhanced probability estimates for gender inference
 Improved support for East-Asian names (Chinese, Korean, Japanese)
5
Gender diversity is just one dimension, there are many other 
6
An artistic illustration of ethnic diversity /
diversity of origin among COVID-19 scientists
Chinese sea at Ars Electronica 2020 by Dario Rodighiero (Harvard Metalab, https://github.com/rodighiero/COVID-19),
Eveline Wandl-Vogt (Austrian Academy of Science) and Elian Carsenat (NamSor)
NamSor CORE taxonomies
 NamSor API* is available and already supports robust, fine-grained
taxonomies for
 Gender
 US Race/Ethnicity
 Country/Origin
 Diaspora
 India Subclassification (States and Union Territories ISO 3166-2:IN)
8
* NamSor v2.0.16, 2021-10
9
Classes
Taxonomy Gender
Male Female
Field Example Description
id ref12315 The input identifier
firstName John The input given name / firstName
lastName Smith The input family name / surname / lastName
likelyGender male The likely gender : male or female
probabilityCalibrated 0.99 The calibrated probability : 0.5 is Unknown, +1 is sure
genderScale -0.99
The scale is -1..0..+1 and is based on the probability (Probability = 0.5 -> Scale
= 0; Gender = Male & Probabilty = 1 -> Scale = -1; Gender = Female &
Probability = 1 -> Scale = +1)
score 41
A non calibrated Score (use Probability instead) : score =
Math.log(getProbaFirst() / getProbaNotFirst()) maxed to 100
Gender classification model infers the likely gender, with probability :
10
4 Classes
or 6 classes*
Taxonomy US Census
Race/Ethnicity
W_NL
(White)
B_NL
(Black)
HL
(Hispano-
Latino)
A
(Asian)
Field Example Description
id ref12315 The input identifier
firstName Mary The input first name / given name
lastName Cao The input last name / surname
countryIso2 US The country of residence, the host country (ex. US, CA, NZ, GB)
raceEthnicity A
The likely 'race'/ethnicity : W_NL (white, non latino), HL (hispano latino), A
(asian, non latino), B_NL (black, non latino)
raceEthnicityAlt W_NL The best alternative 'race'/ethnicity
raceEthnicitiesTop A, W_NL, ... The likely 'race'/ethnicities
probabilityCalibrated 0.91
The calibrated probability of having guessed right the 'race'/ethnicity as A
(Asian)
probabilityCalibratedAlt 0.95
The calibrated probability of having guessed right the 'race'/ethnicity as either
A or W_NL (White Non Latino)
US Race/Ethnicity classifies names by race/ethnicity according to US
Census taxonomy, along with probabilities.
*add header X-OPTION-USRACEETHNICITY-TAXONOMY: USRACEETHNICITY-6CLASSES for two additional classes,
AI_AN (American Indian or Alaskan Native) and PI (Pacific Islander)
11
Classes
Taxonomy Country
IE DE ES MX 
id ref12315 The input identifier
name Jing Cao The input full name
country CN
The likely residence country ISO2 code, which CAN include melting-pot
countries
countryAlt TW The best alternative residence country
region Asia An arbitrary grouping of countries by topRegion/Region/subRegion
topRegion Asia An arbitrary grouping of countries by topRegion/Region/subRegion
subRegion
Eastern
Asia
An arbitrary grouping of countries by topRegion/Region/subRegion
countriesTop
CN, TW,
HK...
The top 10 likely residence country ISO2 codes
probabilityCalibrated .89 The calibrated probability of having guessed right the country of residence (CN)
probabilityCalibratedAlt 0.92
The calibrated probability of having guessed right the country of residence as
either CN or TW.
Country classifies names to ~250 countries with valid ISO2 codes, from Ireland (IE)
to Spain (ES) or Mexico (MX) including all African and Asian countries.
12
Classes
Taxonomy Origin
IE DE ES PT 
id ref12315 The input identifier
name Jing Cao The input full name
country CN
The likely residence country ISO2 code, which CAN include melting-pot
countries
countryAlt TW The best alternative residence country
region Asia An arbitrary grouping of countries by topRegion/Region/subRegion
topRegion Asia An arbitrary grouping of countries by topRegion/Region/subRegion
subRegion
Eastern
Asia
An arbitrary grouping of countries by topRegion/Region/subRegion
countriesTop
CN, TW,
HK...
The top 10 likely residence country ISO2 codes
probabilityCalibrated .89 The calibrated probability of having guessed right the country of residence (CN)
probabilityCalibratedAlt 0.92
The calibrated probability of having guessed right the country of residence as
either CN or TW.
Origin infers the likely country of origin from a name, based on naming patterns
among ~130 countries with strong name identity (IE, DE, ES, PT etc.)
13
Classes
Taxonomy Diaspora
Irish German Hispanic Chinese 
Field Example Description
id ref12315 The input identifier
firstName Mary The input first name / given name
lastName Cao The input last name / surname
countryIso2 US The country of residence, the host country (ex. US, CA, NZ, GB)
ethnicity Chinese The likely ethnicity
ethnicityAlt Vietnamese The best alternative ethnicity
ethnicitiesTop
Chinese,
Vietnamese
, Korean ...
The top 10 likely ethnicities
probabilityCalibrated 0.84 The calibrated probability of having guessed right the ethnicity as Chinese
probabilityCalibratedAlt 0.85
The calibrated probability of having guessed right the country of residence as
either Chinese or Vietnamese.
Diaspora infers the likely ethnicity, diaspora or country of origin from a name, given a
geographic context (ex. US, CA, ...) with ~130 ethnicities (Irish, Chinese, etc)
14
Classes
Taxonomy Subclassification (India)
IN-AP
Andhra
Pradesh
IN-AR
Arunchal
Pradesh
IN-AS
Assam

Field Example Description
id ref12315 The input identifier
firstName Bhupen The input first name / given name
lastName Borah The input last name / surname
countryIso2 IN The country (initially only IN : India is supported)
subClassification IN-AR The likely state/region
subClassificationAlt IN-ML The best alternative state/region
subClassificationTop
IN-AR, IN-
ML...
The top 10 likely states/regions
probabilityCalibrated 0.84
The calibrated probability of having guessed right the likely state/region as IN-
AR
probabilityCalibratedAlt 0.85
The calibrated probability of having guessed right the likely state/region as IN-
AR or as IN-ML
Subclassification infers the likely state/region (a sub-level of country). Initially this model is
calibrated only for India (IN) States or Union Territories (ISO 3166-2:IN). We can expand this
model to other countries, let us know.
Limitations to such taxonomies
 Human societies are fractal in their diversity :
 A coarse-grained classification model may not fit all markets (ex. African-
American/Black vs. White vs. African / Black : how does North-African fit?)
 A fine-grained classification model may be too fine-grained or controversial in
specific regions
 For example, IN/Indian is one class among 130 classes in our Origin/Diaspora
taxonomy, but there are ~30 states in India with many ethnic/clan/caste system
sub-groups
15
Liberia - a regional onomastics 'mille-feuille'
Example of complex regional
or ethnic identities in Africa :
Liberia.
This visualization utilizes
unsupervised name
classification algorithm, to
recognize subgroups in
different regions of Liberia.
 Privacy and self-identification : how can people override the classification ?
Thank you !
Elian CARSENAT,
elian.carsenat@namsor.com
Phone : +33 6 52 77 99 07
Try NamSor for yourself at,
https://namsor.app/
16

More Related Content

More from Elian CARSENAT (13)

PDF
Claro+Namsor Diaspora Mapping and Engagement
Elian CARSENAT
PDF
GEOINT visualization of the Tunisian Diaspora in Europe
Elian CARSENAT
PDF
Promouvoir l'investissement en Afrique
Elian CARSENAT
PDF
Gender Gap in Corporate Governance : AFRICA
Elian CARSENAT
PDF
FDI Magnet wishes you a happy 2016!
Elian CARSENAT
PDF
Diasporas Digital D辿veloppement
Elian CARSENAT
PDF
NamSor at RapidMiner Wisdom 2015 (Ljubljana, Slovenia)
Elian CARSENAT
PDF
HomeComing for Develoment in Africa
Elian CARSENAT
PDF
#APIDays Paris - NamSor API for 'Gender Gap Grader'
Elian CARSENAT
PDF
Data Geeks Paris - Cherchez la Femme
Elian CARSENAT
PDF
Text mining names in Big Data to recognize migration trends
Elian CARSENAT
PPTX
BigData Paris 2014 - Enjeux Sociaux
Elian CARSENAT
PPTX
R担le des fran巽ais de l'辿tranger pour faire rayonner la 'Marque France', les M...
Elian CARSENAT
Claro+Namsor Diaspora Mapping and Engagement
Elian CARSENAT
GEOINT visualization of the Tunisian Diaspora in Europe
Elian CARSENAT
Promouvoir l'investissement en Afrique
Elian CARSENAT
Gender Gap in Corporate Governance : AFRICA
Elian CARSENAT
FDI Magnet wishes you a happy 2016!
Elian CARSENAT
Diasporas Digital D辿veloppement
Elian CARSENAT
NamSor at RapidMiner Wisdom 2015 (Ljubljana, Slovenia)
Elian CARSENAT
HomeComing for Develoment in Africa
Elian CARSENAT
#APIDays Paris - NamSor API for 'Gender Gap Grader'
Elian CARSENAT
Data Geeks Paris - Cherchez la Femme
Elian CARSENAT
Text mining names in Big Data to recognize migration trends
Elian CARSENAT
BigData Paris 2014 - Enjeux Sociaux
Elian CARSENAT
R担le des fran巽ais de l'辿tranger pour faire rayonner la 'Marque France', les M...
Elian CARSENAT

Recently uploaded (6)

DOCX
Muhammad Akbar Hussain Founder Profile Corsa Leathers.docx
Corsa Leathers
PPTX
Company Analysis ReportTeamLease___.pptx
Reetika Sharma
PDF
Silent_Killers_of_Productivity_Infographics.pdf
CA Suvidha Chaplot
PDF
The Human Capital Paradox: Navigating 2025's Greatest Workforce Challenges
Pascal Gerardus Angriawan
PDF
Integration of Technology into HR Operations.pdf
abhiaconsultancy
PDF
What it Really Costs to Employ Across Europe in 2025.pdf
Boundless HQ
Muhammad Akbar Hussain Founder Profile Corsa Leathers.docx
Corsa Leathers
Company Analysis ReportTeamLease___.pptx
Reetika Sharma
Silent_Killers_of_Productivity_Infographics.pdf
CA Suvidha Chaplot
The Human Capital Paradox: Navigating 2025's Greatest Workforce Challenges
Pascal Gerardus Angriawan
Integration of Technology into HR Operations.pdf
abhiaconsultancy
What it Really Costs to Employ Across Europe in 2025.pdf
Boundless HQ
Ad

NamSor AI Bias Estimator, Gender, Racial and Ethnic Fairness Toolkit

  • 1. Expanding on Gender Diversity Report : NamSor algorithms for classification of names by Race/ethnicity or cultural origin/diasporas NamSor 1 2018-01
  • 2. Gender, race/ethnicity or origin bias in AI ? Algorithms are used to assist human decision in funnel-based processes, ex. - recruitment, - credit allocation, AI especially used in the early stage of the selection process (ex resume sourcing or screening) : search, scoring, tagging Is the algorithm FAIR? 2
  • 3. Estimating gender, racial/ethnic bias in algorithms ex. recruitment Two approaches : 1) Use Aequitas, an open source bias audit toolkit developed by the Center for Data Science and Public Policy at University of Chicago 2) Measure changes in diversity index (Shannon or Simpson) at each selective step What taxonomy for diversity analytics? What is race/ethnicity ? 3
  • 4. NamSor sorts Names 4 Names reflect cultural Identity Since 2012, NamSor data mining software recognizes the linguistic or cultural origin of names in any alphabet / language, using both supervised and unsupervised machine learning (ie. clustering). 2014 : launch Gender API v1 2018 : software is re-written from scratch with standard ML frameworks : 1/ name embedding + neural networks 2/ na誰ve bayes classifier 2019 : launch NamSor API v2 with Gender, US Race/Ethnicity, Country/Origin/Diaspora classifiers
  • 5. Our proud contribution to Gender Reports NamSor Gender API (v1) was used independently by both by Science-Metrix and Elsevier in 2015 and 2017 NamSor Gender API V2 was used for The Researcher Journey Through a Gender Lens and weve made specific improvements : Enhanced probability estimates for gender inference Improved support for East-Asian names (Chinese, Korean, Japanese) 5
  • 6. Gender diversity is just one dimension, there are many other 6
  • 7. An artistic illustration of ethnic diversity / diversity of origin among COVID-19 scientists Chinese sea at Ars Electronica 2020 by Dario Rodighiero (Harvard Metalab, https://github.com/rodighiero/COVID-19), Eveline Wandl-Vogt (Austrian Academy of Science) and Elian Carsenat (NamSor)
  • 8. NamSor CORE taxonomies NamSor API* is available and already supports robust, fine-grained taxonomies for Gender US Race/Ethnicity Country/Origin Diaspora India Subclassification (States and Union Territories ISO 3166-2:IN) 8 * NamSor v2.0.16, 2021-10
  • 9. 9 Classes Taxonomy Gender Male Female Field Example Description id ref12315 The input identifier firstName John The input given name / firstName lastName Smith The input family name / surname / lastName likelyGender male The likely gender : male or female probabilityCalibrated 0.99 The calibrated probability : 0.5 is Unknown, +1 is sure genderScale -0.99 The scale is -1..0..+1 and is based on the probability (Probability = 0.5 -> Scale = 0; Gender = Male & Probabilty = 1 -> Scale = -1; Gender = Female & Probability = 1 -> Scale = +1) score 41 A non calibrated Score (use Probability instead) : score = Math.log(getProbaFirst() / getProbaNotFirst()) maxed to 100 Gender classification model infers the likely gender, with probability :
  • 10. 10 4 Classes or 6 classes* Taxonomy US Census Race/Ethnicity W_NL (White) B_NL (Black) HL (Hispano- Latino) A (Asian) Field Example Description id ref12315 The input identifier firstName Mary The input first name / given name lastName Cao The input last name / surname countryIso2 US The country of residence, the host country (ex. US, CA, NZ, GB) raceEthnicity A The likely 'race'/ethnicity : W_NL (white, non latino), HL (hispano latino), A (asian, non latino), B_NL (black, non latino) raceEthnicityAlt W_NL The best alternative 'race'/ethnicity raceEthnicitiesTop A, W_NL, ... The likely 'race'/ethnicities probabilityCalibrated 0.91 The calibrated probability of having guessed right the 'race'/ethnicity as A (Asian) probabilityCalibratedAlt 0.95 The calibrated probability of having guessed right the 'race'/ethnicity as either A or W_NL (White Non Latino) US Race/Ethnicity classifies names by race/ethnicity according to US Census taxonomy, along with probabilities. *add header X-OPTION-USRACEETHNICITY-TAXONOMY: USRACEETHNICITY-6CLASSES for two additional classes, AI_AN (American Indian or Alaskan Native) and PI (Pacific Islander)
  • 11. 11 Classes Taxonomy Country IE DE ES MX id ref12315 The input identifier name Jing Cao The input full name country CN The likely residence country ISO2 code, which CAN include melting-pot countries countryAlt TW The best alternative residence country region Asia An arbitrary grouping of countries by topRegion/Region/subRegion topRegion Asia An arbitrary grouping of countries by topRegion/Region/subRegion subRegion Eastern Asia An arbitrary grouping of countries by topRegion/Region/subRegion countriesTop CN, TW, HK... The top 10 likely residence country ISO2 codes probabilityCalibrated .89 The calibrated probability of having guessed right the country of residence (CN) probabilityCalibratedAlt 0.92 The calibrated probability of having guessed right the country of residence as either CN or TW. Country classifies names to ~250 countries with valid ISO2 codes, from Ireland (IE) to Spain (ES) or Mexico (MX) including all African and Asian countries.
  • 12. 12 Classes Taxonomy Origin IE DE ES PT id ref12315 The input identifier name Jing Cao The input full name country CN The likely residence country ISO2 code, which CAN include melting-pot countries countryAlt TW The best alternative residence country region Asia An arbitrary grouping of countries by topRegion/Region/subRegion topRegion Asia An arbitrary grouping of countries by topRegion/Region/subRegion subRegion Eastern Asia An arbitrary grouping of countries by topRegion/Region/subRegion countriesTop CN, TW, HK... The top 10 likely residence country ISO2 codes probabilityCalibrated .89 The calibrated probability of having guessed right the country of residence (CN) probabilityCalibratedAlt 0.92 The calibrated probability of having guessed right the country of residence as either CN or TW. Origin infers the likely country of origin from a name, based on naming patterns among ~130 countries with strong name identity (IE, DE, ES, PT etc.)
  • 13. 13 Classes Taxonomy Diaspora Irish German Hispanic Chinese Field Example Description id ref12315 The input identifier firstName Mary The input first name / given name lastName Cao The input last name / surname countryIso2 US The country of residence, the host country (ex. US, CA, NZ, GB) ethnicity Chinese The likely ethnicity ethnicityAlt Vietnamese The best alternative ethnicity ethnicitiesTop Chinese, Vietnamese , Korean ... The top 10 likely ethnicities probabilityCalibrated 0.84 The calibrated probability of having guessed right the ethnicity as Chinese probabilityCalibratedAlt 0.85 The calibrated probability of having guessed right the country of residence as either Chinese or Vietnamese. Diaspora infers the likely ethnicity, diaspora or country of origin from a name, given a geographic context (ex. US, CA, ...) with ~130 ethnicities (Irish, Chinese, etc)
  • 14. 14 Classes Taxonomy Subclassification (India) IN-AP Andhra Pradesh IN-AR Arunchal Pradesh IN-AS Assam Field Example Description id ref12315 The input identifier firstName Bhupen The input first name / given name lastName Borah The input last name / surname countryIso2 IN The country (initially only IN : India is supported) subClassification IN-AR The likely state/region subClassificationAlt IN-ML The best alternative state/region subClassificationTop IN-AR, IN- ML... The top 10 likely states/regions probabilityCalibrated 0.84 The calibrated probability of having guessed right the likely state/region as IN- AR probabilityCalibratedAlt 0.85 The calibrated probability of having guessed right the likely state/region as IN- AR or as IN-ML Subclassification infers the likely state/region (a sub-level of country). Initially this model is calibrated only for India (IN) States or Union Territories (ISO 3166-2:IN). We can expand this model to other countries, let us know.
  • 15. Limitations to such taxonomies Human societies are fractal in their diversity : A coarse-grained classification model may not fit all markets (ex. African- American/Black vs. White vs. African / Black : how does North-African fit?) A fine-grained classification model may be too fine-grained or controversial in specific regions For example, IN/Indian is one class among 130 classes in our Origin/Diaspora taxonomy, but there are ~30 states in India with many ethnic/clan/caste system sub-groups 15 Liberia - a regional onomastics 'mille-feuille' Example of complex regional or ethnic identities in Africa : Liberia. This visualization utilizes unsupervised name classification algorithm, to recognize subgroups in different regions of Liberia. Privacy and self-identification : how can people override the classification ?
  • 16. Thank you ! Elian CARSENAT, elian.carsenat@namsor.com Phone : +33 6 52 77 99 07 Try NamSor for yourself at, https://namsor.app/ 16