端端舝

Kaggle 每 Airbnb New User
Bookings及失皿伕奈民卞勾中化
Kaggle Tokyo Meetup #1
2016/03/05
id:@Keiku

掛゜及失斥尼件母
? Airbnb New User Bookings戊件矢衙猁
每 Dataset卞勾中化
每 Metric卞勾中化
? 掛戊件矢卞統樓仄凶�辻
? 失皿伕奈民卞勾中化
每 Preprocessing
每 Stacked generalization
每 Modeling
每 Results
? Shakeup卞勾中化
? 云歹曰卞

Dataset卞勾中化(1)
? train_users.csv - the training set of
users
? test_users.csv - the test set of users
每 id: user id
每 date_account_created: the date of account
creation
每 timestamp_first_active: timestamp of the
first activity, note that it can be earlier
than date_account_created
or date_first_booking because a user can
search before signing up
每 date_first_booking: date of first booking
每 gender
每 age
每 signup_method
每 signup_flow: the page a user came to signup
up from
每 language: international
language preference
每 affiliate_channel: what kind
of paid marketing
每 affiliate_provider: where the
marketing is e.g. google,
craigslist, other
每 first_affiliate_tracked: whats
the first marketing the user
interacted with before the
signing up
每 signup_app
每 first_device_type
每 first_browser
每 country_destination: this is
the target variable you are
to predict

Dataset卞勾中化(2)
? sessions.csv - web sessions log for users
每 user_id: to be joined with the column 'id' in users table
每 action
每 action_type
每 action_detail
每 device_type
每 secs_elapsed
? countries.csv - summary statistics of destination countries in
this dataset and their locations
? age_gender_bkts.csv - summary statistics of users' age group,
gender, country of destination
? sample_submission.csv - correct format for submitting your
predictions

Metric卞勾中化(1)
? The evaluation metric for this competition is NDCG (Normalized
discounted cumulative gain) @k where k=5. NDCG is calculated as:
? where reli is the relevance of the result at position i.
? IDCGk is the maximum possible (ideal) DCG for a given set of queries. All
NDCG calculations are relative values on the interval 0.0 to 1.0.
? For each new user, you are to make a maximum of 5 predictions on the
country of the first booking. The ground truth country is marked with
relevance = 1, while the rest have relevance = 0.
? For example, if for a particular user the destination is FR, then the
predictions become:

掛戊件矢卞統樓仄凶�辻
? 翋卅燴蚕
每 Learning to rank(Metric互NDCG)及�觳卞龰曰瞎氏匹心凶井勻凶
? 綎�卞反﹜Personalize Expedia Hotel Searches - ICDM 2013
每 Train dataset及ヽ嶲互2010/01?2014/06﹜Test dataset及ヽ嶲互2014/07
?2014/09匹丐勻凶
? 仇及正奶皿及犯奈正及Cross Validation卞賴忒砩舑及丐月
? 綎�戊件矢ㄩ
每 Rossmann Store Sales
每 Recuruit - Coupon Purchase Prediction
每 Avazu - Click-Through Rate Prediction
每犯乒弘仿互嗣仁﹜杻釾講互勾仁曰支允中
? 絞媆及袨暿
每戊件矢ヽ嶲反﹜2015/11/25?2016/02/11(78゜嶲)匹﹜First submission反
2016/01/25匹丐曰皺屜
每紹曰3筥嶲辭�及凶戶卞統樓

Preprocessing(1)
? 杻釾喲堤
每 age囀卞漪引木月汜爛堎゜毛党淏允月
每 date_first_booking午date_account_created及lag毛�呾仄﹜公木毛4市氾打伉卞
摩廣允月
每 date_first_booking午timestamp_first_active及lag毛�呾仄﹜公木毛3市氾打伉卞
摩廣允月
每市氾打伉市伙劐杅毛One-Hot Encoding允月
每 train_users.csv﹜test_users.csv卞age_gender_bkts.csv毛join允月
每 train_users.csv﹜test_users.csv卞countries.csv毛join允月
每 sessions.csv毛(user_id﹜action)毛平奈卞secs_elapsed午俴杅毛扔穴伉﹜
train_users.csv﹜test_users.csv卞join允月ㄗaction眕俋及劐杅手肮�ㄘ
? 杻釾喲堤毛允月卞丐凶曰
每妏尹月手及反允屯化妏丹
每 sessions.csv反唗蹈俶手𨈘�仄凶互�彆反卅井勻凶﹝Telstra Network
Disruptions及犯奈正反啋及唗蹈俶互Magic features午卅勻凶瞰手丐月

Preprocessing(2)
? R及{DescTools}由永弗奈斥互晞瞳
? Desc()匹價渙緙�講互允屯化歹井月

Preprocessing(3)
? R及{DescTools}由永弗奈斥互晞瞳
? Desc()匹價渙緙�講互允屯化歹井月

Stacked generalization
? 眕狟及18乒犯伙卞勾中化Stacking
1. ModelㄩXGBoost / Targetㄩage / Train datasetㄩage準セ𢖯
2. ModelㄩXGBoost / Targetㄩage_cln / Train datasetㄩage準セ𢖯
3. ModelㄩXGBoost / Targetㄩage_cln2 / Train datasetㄩage準セ𢖯
4. Modelㄩglmnet / Targetㄩage_cln / Train datasetㄩage準セ𢖯
5. Modelㄩglmnet / Targetㄩage_cln2 / Train datasetㄩage準セ𢖯
6. ModelㄩXGBoost / Targetㄩcountry_destination / Train datasetㄩTrain�ヽ嶲
7. ModelㄩXGBoost / Targetㄩcountry_destination / Train datasetㄩ眻輪12仳堎
8. ModelㄩXGBoost / Targetㄩcountry_destination / Train datasetㄩ眻輪6仳堎
9. ModelㄩXGBoost / Targetㄩcountry_destination / Train datasetㄩ�爛及7,8,9堎
10. ModelㄩXGBoost / Targetㄩdistance_km / Train datasetㄩdistance_km準セ𢖯
11. ModelㄩXGBoost / Targetㄩdestination_km2 / Train datasetㄩdestination_km2準セ𢖯
12. ModelㄩXGBoost / Targetㄩgender / Train datasetㄩ準-unknown-
13. ModelㄩXGBoost / Targetㄩdfb_dac_lag_flg / Train datasetㄩTrain�ヽ嶲
14. ModelㄩXGBoost / Targetㄩdfb_tfa_lag_flg / Train datasetㄩTrain�ヽ嶲
15. ModelㄩXGBoost / Targetㄩdfb_dac_lag / Train datasetㄩTrain�ヽ嶲
16. ModelㄩXGBoost / Targetㄩdfb_tfa_lag / Train datasetㄩTrain�ヽ嶲
17. Modelㄩglmnet / Targetㄩdfb_dac_lag / Train datasetㄩTrain�ヽ嶲
18. Modelㄩglmnet / Targetㄩdfb_tfa_lag / Train datasetㄩTrain�ヽ嶲

Modeling(1)
? XGBoost毛妏勻化乒犯伉件弘
每 eval_metric反NDCG@5
? merror﹜mlogloss反郔皺腔卞妏蚚仄卅井勻凶
? c4.8xlarge匹1 round及CV匹1煦幻升﹝醱給分互騵尹月
每 Drip Coffee10戚煦仁日中秏囮:-)
每 Techniques (Tricks) for Data Mining Competitions(@smly)
? BO﹜RSCV卅升卞方月民亙奈瓦件弘及�珂僅反腴井勻凶
? 杻釾腢亼
每杻卞汜及age反儕僅毛邈午仄凶
? 杻釾腢亼允月仇午匹儕僅互珨𠸎卞砃奻
每 90%毛仿件母丞卞杻釾腢亼仄化乒犯伙毛釬傖

Modeling(2)
? XGBoost及劐杅笭猁
僅
每眻輪12仳堎及
country_destination
每 dfb_dac_lag_flg(XGBoost)
每眻輪6仳堎及
country_destination
每 �爛及7,8,9堎及
country_destination
每 age_cln2(XGBoost)

Results(1)
? 儕僅珨笊
每郔皺腔卞﹜submission12(5-fold CV)﹜16(Last 6 weeks)毛腢亼
Submission Memo 5 fold-CV Public Private Public Rank Private Rank
submission01.csv.7z
merror﹜mlogloss卅升
匹�俴嶒悷
0.87958 0.88419
submission02.csv.7z 0.87848 0.88201
submission07.csv.7z Stacking卅仄 0.83265 0.88013 0.88590 152 55
submission08.csv.7z Stacking丐曰 0.83318 0.88123 0.88645 36 12
submission09.csv.7z Feature Selection(1) 0.83355 0.88162 0.88705 12 1
submission14.csv.7z Last 6 weeks(1) 0.83319 0.88167 0.88696 12 2
submission15.csv.7z 12及Bagging 0.83371 0.88207 0.88688 2 2
submission16.csv.7z Last 6 weeks(2) 0.83346 0.88195 0.88678 2 2

Results(2)
? 儕僅復庲
0.87900
0.88000
0.88100
0.88200
0.88300
0.88400
0.88500
0.88600
0.88700
0.88800
0.83240 0.83260 0.83280 0.83300 0.83320 0.83340 0.83360 0.83380
LBScore
Local 5 fold-CV Score
NDCG@5 Score
Public
Private

Shakeup卞勾中化
? Forum卞☆Expected Leaderboard Shakeup★午中丹Topic互蕾勾幻升Shakeup
互𤍈癩今木月
? 佌及蕉舷
每 5-fold CV午Public LB Score及憝窣互�仁﹜�g�卞�I源午手謎中旦戊失及乒
犯伙毛腢屯壬謎井勻凶
每郔皺Submission卞2勾腢屯月及匹﹜1勾反Public LB及Score互郔手詢中乒
犯伙﹜手丹1勾反Last 6 weeks及Score互郔手詢中乒犯伙毛腢亼仄凶
? Best Public LBㄩPublic: 0.88209(2nd)/Private: 0.88682(2nd)
? Best Last 6 weeks ValidationㄩPublic: 0.88195(2nd)/Private:
0.88678(2nd)
每 Shakeup卞�公丹卅Gilberto Titericz Junior今氏及戊丟件玄
? CV及旦戊失互肮元手及2勾互丐曰﹜Public LB互謎中源毛腢氏分互﹜珨
源反手勻午謎中旦戊失匹丐曰﹜Public: 0.88107(57th)/Private:
0.88675(3rd)匹丐勻凶
每扑件皿伙卅乒犯伙�慷毛陑互仃凶﹝失件扔件皮伙反丐引曰�彆互卅中
每釬傖仄凶杻釾講互�仁﹜Shakeup仄凶手及及奻弇匹午升引勻凶

云歹曰卞
? 戊件矢淥曰殿曰
每都卞辭�允月午中丹旦正件旦匹龰曰瞎心﹜杻卞嶱宎媆ヽ反
𠸎卞仄卅中
每犯奈正毛勾少今卞�化﹜蕉舷允月
? Train卞仄井卅中date_first_booking手妏尹月井𨈘�允月
每 Evaluation Metric反醱給匹手磁歹六月
每價掛腔卞反Cross Validation及磐彆互謎中乒犯伙毛腢亼
每 Shakeup及𤍈癩及丐月�磁﹜扑件皿伙卅乒犯伙�慷毛陑
互仃﹜�卅月Validation由正奈件毛蚚砩仄化云仁
每 Results(2)及方丹卅弘仿白反斛內𤩸仃月方丹卞丟乒毛龰月

端端舝

Kaggle 每 Airbnb New User Bookings及失皿伕奈民卞勾中化(Kaggle Tokyo Meetup #1 20160305)

More Related Content

What's hot (20)

Similar to Kaggle 每 Airbnb New User Bookings及失皿伕奈民卞勾中化(Kaggle Tokyo Meetup #1 20160305) (20)

Recently uploaded (9)

Kaggle 每 Airbnb New User Bookings及失皿伕奈民卞勾中化(Kaggle Tokyo Meetup #1 20160305)