際際滷

際際滷Share a Scribd company logo
Final presentation
Final presentation
In short, the falling crime rate we've
enjoyed may come at a cost: police
indifference when you report your
stereo was stolen.
From NPR.org March 30th 2015
Hypothesis Potential Attributes
Type of crime Crime Type (NIBRS raw class,
NIBRS category/against)
Location of crime Lat / Long, Distance to high risk
locations (homeless shelter, etc.)
Victim Profile Age, race, ethnicity, gender
Crime waves Normalized rolling count of
crimes in the last 7 or 30 days.
Information Provided
(Clues)
Witness Present Flag, Witness
Demographics (age, gender)
Time of Crime Hour of the day
Day/Week of Crime Day of the week, Week of the
year
Extreme Weather Days with Snow (e.g. Feb 2014
Snowstorm), Days with Severe
Weather
Amount of Damage
(Property Crimes only)
Property Damage Amount,
Property Type
Hypothesis Potential Attributes
Police/Department
strategy
Not included in the dataset.
Police Response Not included in the dataset.
Police Bias Not included in the dataset
Officer / Department
Training
Not included in the dataset
Demographics of Officer Not included in the dataset
Association of Crimes
(Hidden Network)
Not included in the dataset
Institutional Factors (DA
Office, etc.)
Not included in the dataset
Other External Factors (e.g.
media coverage of a crime)
Difficult to measure and out
of scope. Would need to
append data (e.g. # of
media articles per crime)
Testable Hypotheses Non-Testable Hypotheses
Final presentation
Step in Preparing Model Dataset Change Records
Starting Population: Original Dataset 261,254
Remove Non-Crimes -25,992 235,262
Remove Unfound and Misc. Clear Status -30,593 204,669
Remove Non-CLT Crimes (e.g. Matthews) -1,367 203,302
Final Model Dataset 203,302
Variable Category # Fields
Crime Type 3
Location 9
Date / Time 4
Crime Wave 2
Neighborhood Demographics (QofL) 10
Police Response 1
Property 1
Severe Weather Flag 2
Victim 6
Business Victim 6
Victim/Reporting Flag 3
Victim-Suspect Relationship 3
Grand Total 50
Variables by Category
Exclusions
Rank Variable Chi Square
1 Crime Type I (NIBRS Hi Class) 0.6247
2 Crime Type II (Category) 0.5550
3 Crime Wave: Rolling 7 Day Avg 0.4914
4 Crime against Public 0.4682
5 Crime Type III (Against) 0.4637
6 Crime against NC State 0.4443
7 Victim Age (Binned) 0.3577
8 Property Value (Decile) 0.3041
9 Place2 (e.g. 30+ location types) 0.2687
10 Witness Flag: Provided Address Info 0.2679
11 Latitude of Crime 0.1955
12 Longitude of Crime 0.1904
13 Place1(e.g. 6 location types) 0.1889
14 Victim is White 0.1687
15 Crime against Wal-Mart 0.1622
16 Victim Knew Suspect Outside of Family 0.1544
17 Crime Wave: Rolling 30 Day Avg 0.1408
18 Hour of Day of Crime 0.1370
19 Victim Knew Suspect Inside of Family 0.1345
20 Crime Reported by Officer Flag 0.1247
Clearance Rates after exclusions (non-Crime, etc) applied
*Used H2O (via R Studio interface) for the model
H20s website: http://h2o.ai/
Metrics used for Model Evaluation:
1) Accuracy
2) Area-under-the-Curve (AUC)
"Simple" CART Train Valid Test
Accuracy 0.8033 0.8021 0.7988
AUC 0.8283 0.8290 0.8257
Accuracy Train Valid Test AUC Train Valid Test
"Simple" CART 0.8033 0.8021 0.7988 "Simple" CART 0.8283 0.8290 0.8257
CART 0.8327 0.8300 0.8276 CART 0.8524 0.8516 0.8480
Na誰ve Bayes 0.7495 0.7507 0.7455 Na誰ve Bayes 0.7951 0.7949 0.7915
GLM (Regularized) 0.8257 0.8149 0.7832 GLM (Regularized) 0.9157 0.9069 0.8781
GBM 0.8808 0.8463 0.8479 GBM 0.9528 0.9243 0.9241
Deep Learning 0.8573 0.8404 0.8390 Deep Learning 0.9346 0.9202 0.9171
Random Forests 0.8541 0.8402 0.8389 Random Forests 0.9263 0.9154 0.9128
Accuracy Train Valid Test AU
"Simple" CART 0.8033 0.8021 0.7988 "S
CART 0.8327 0.8300 0.8276 CA
Na誰ve Bayes 0.7495 0.7507 0.7455 Na
GLM (Regularized) 0.8257 0.8149 0.7832 GL
GBM 0.8808 0.8463 0.8479 GB
Deep Learning 0.8573 0.8404 0.8390 De
Random Forests 0.8541 0.8402 0.8389 Ra
Appendix includes ModelTuning Parameters
Final presentation
R Code is available on GitHub:
https://github.com/wesslen/MachineLearningProject
1. Crime
Occurs
2. Crime
Reported
3. Police
Collect
Info
4. Police
Prioritize
Crime
5. Solve
or not
solve.
Weatherburn, Donald James., and Bronwyn Lind. Delinquent-prone
Communities. Cambridge, UK: Cambridge UP, 2001. Print.
Each increase in the prevalence of involvement in crime expands
the scope for further contact between delinquents and susceptibles,
thereby fueling further increases in the level of participation in crime
Redhanded
Crimes
Clearance Status 2012 2013 2014
Exceptionally Cleared - By Death of Offender 16 23 19
Exceptionally Cleared - Cleared by Other Means 962 1,383 1,311
Exceptionally Cleared - Extradition Declined 2 2 1
Exceptionally Cleared - Located (Missing Persons and Runaways only) 14 13 15
Exceptionally Cleared - Prosecution Declined by DA 173 209 174
Exceptionally Cleared - Victim Chose not to Prosecute 6,322 5,781 5,594
Normal Clearance - Cleared by Arrest 21,334 19,089 20,506
Normal Clearance - Cleared by Arrest by Another Agency 228 386 330
Open 46,798 45,937 47,349
Open - Cleared, Pending Arrest Validation 65 557 389
Unfounded 3,816 3,316 3,148
Total 79,730 76,696 78,836
Total Excluding Rare Clearances (Blue) 69,094 66,409 69,166
Clearance Rate (Normal Clearance / Total Excluding Rare) 32.3% 30.8% 31.5%
Reported Year
Blue = Excluded from model
Yellow = Event in the Dependent
Variable Flag (i.e. equal to 1)
Green = Non-event in the
DependentVariable Flag (i.e.
equal to 0)
Model Tuning Parameters
CART (Simple and Normal) Complexity =0.001, Minimum Split = 1000, Minimum Bucket Size
=1000, Maximum Depth = 5
Na誰ve Bayes Laplace Smoother = 3
GLM with Regularization Alpha = 1 (Lasso)
GBM Number ofTrees = 200, Maximum Depth = 5, Interaction Depth = 2,
Learning Rate = 0.2
Deep Learning 3 Hidden Layers, each with 200 nodes
Random Forests Number ofTrees = 50, Maximum Depth = 10, Minimum Rows = 5,
Number of Bins = 20

More Related Content

Final presentation

  • 3. In short, the falling crime rate we've enjoyed may come at a cost: police indifference when you report your stereo was stolen. From NPR.org March 30th 2015
  • 4. Hypothesis Potential Attributes Type of crime Crime Type (NIBRS raw class, NIBRS category/against) Location of crime Lat / Long, Distance to high risk locations (homeless shelter, etc.) Victim Profile Age, race, ethnicity, gender Crime waves Normalized rolling count of crimes in the last 7 or 30 days. Information Provided (Clues) Witness Present Flag, Witness Demographics (age, gender) Time of Crime Hour of the day Day/Week of Crime Day of the week, Week of the year Extreme Weather Days with Snow (e.g. Feb 2014 Snowstorm), Days with Severe Weather Amount of Damage (Property Crimes only) Property Damage Amount, Property Type Hypothesis Potential Attributes Police/Department strategy Not included in the dataset. Police Response Not included in the dataset. Police Bias Not included in the dataset Officer / Department Training Not included in the dataset Demographics of Officer Not included in the dataset Association of Crimes (Hidden Network) Not included in the dataset Institutional Factors (DA Office, etc.) Not included in the dataset Other External Factors (e.g. media coverage of a crime) Difficult to measure and out of scope. Would need to append data (e.g. # of media articles per crime) Testable Hypotheses Non-Testable Hypotheses
  • 6. Step in Preparing Model Dataset Change Records Starting Population: Original Dataset 261,254 Remove Non-Crimes -25,992 235,262 Remove Unfound and Misc. Clear Status -30,593 204,669 Remove Non-CLT Crimes (e.g. Matthews) -1,367 203,302 Final Model Dataset 203,302 Variable Category # Fields Crime Type 3 Location 9 Date / Time 4 Crime Wave 2 Neighborhood Demographics (QofL) 10 Police Response 1 Property 1 Severe Weather Flag 2 Victim 6 Business Victim 6 Victim/Reporting Flag 3 Victim-Suspect Relationship 3 Grand Total 50 Variables by Category Exclusions
  • 7. Rank Variable Chi Square 1 Crime Type I (NIBRS Hi Class) 0.6247 2 Crime Type II (Category) 0.5550 3 Crime Wave: Rolling 7 Day Avg 0.4914 4 Crime against Public 0.4682 5 Crime Type III (Against) 0.4637 6 Crime against NC State 0.4443 7 Victim Age (Binned) 0.3577 8 Property Value (Decile) 0.3041 9 Place2 (e.g. 30+ location types) 0.2687 10 Witness Flag: Provided Address Info 0.2679 11 Latitude of Crime 0.1955 12 Longitude of Crime 0.1904 13 Place1(e.g. 6 location types) 0.1889 14 Victim is White 0.1687 15 Crime against Wal-Mart 0.1622 16 Victim Knew Suspect Outside of Family 0.1544 17 Crime Wave: Rolling 30 Day Avg 0.1408 18 Hour of Day of Crime 0.1370 19 Victim Knew Suspect Inside of Family 0.1345 20 Crime Reported by Officer Flag 0.1247 Clearance Rates after exclusions (non-Crime, etc) applied
  • 8. *Used H2O (via R Studio interface) for the model H20s website: http://h2o.ai/ Metrics used for Model Evaluation: 1) Accuracy 2) Area-under-the-Curve (AUC)
  • 9. "Simple" CART Train Valid Test Accuracy 0.8033 0.8021 0.7988 AUC 0.8283 0.8290 0.8257
  • 10. Accuracy Train Valid Test AUC Train Valid Test "Simple" CART 0.8033 0.8021 0.7988 "Simple" CART 0.8283 0.8290 0.8257 CART 0.8327 0.8300 0.8276 CART 0.8524 0.8516 0.8480 Na誰ve Bayes 0.7495 0.7507 0.7455 Na誰ve Bayes 0.7951 0.7949 0.7915 GLM (Regularized) 0.8257 0.8149 0.7832 GLM (Regularized) 0.9157 0.9069 0.8781 GBM 0.8808 0.8463 0.8479 GBM 0.9528 0.9243 0.9241 Deep Learning 0.8573 0.8404 0.8390 Deep Learning 0.9346 0.9202 0.9171 Random Forests 0.8541 0.8402 0.8389 Random Forests 0.9263 0.9154 0.9128 Accuracy Train Valid Test AU "Simple" CART 0.8033 0.8021 0.7988 "S CART 0.8327 0.8300 0.8276 CA Na誰ve Bayes 0.7495 0.7507 0.7455 Na GLM (Regularized) 0.8257 0.8149 0.7832 GL GBM 0.8808 0.8463 0.8479 GB Deep Learning 0.8573 0.8404 0.8390 De Random Forests 0.8541 0.8402 0.8389 Ra Appendix includes ModelTuning Parameters
  • 12. R Code is available on GitHub: https://github.com/wesslen/MachineLearningProject
  • 13. 1. Crime Occurs 2. Crime Reported 3. Police Collect Info 4. Police Prioritize Crime 5. Solve or not solve. Weatherburn, Donald James., and Bronwyn Lind. Delinquent-prone Communities. Cambridge, UK: Cambridge UP, 2001. Print. Each increase in the prevalence of involvement in crime expands the scope for further contact between delinquents and susceptibles, thereby fueling further increases in the level of participation in crime
  • 15. Clearance Status 2012 2013 2014 Exceptionally Cleared - By Death of Offender 16 23 19 Exceptionally Cleared - Cleared by Other Means 962 1,383 1,311 Exceptionally Cleared - Extradition Declined 2 2 1 Exceptionally Cleared - Located (Missing Persons and Runaways only) 14 13 15 Exceptionally Cleared - Prosecution Declined by DA 173 209 174 Exceptionally Cleared - Victim Chose not to Prosecute 6,322 5,781 5,594 Normal Clearance - Cleared by Arrest 21,334 19,089 20,506 Normal Clearance - Cleared by Arrest by Another Agency 228 386 330 Open 46,798 45,937 47,349 Open - Cleared, Pending Arrest Validation 65 557 389 Unfounded 3,816 3,316 3,148 Total 79,730 76,696 78,836 Total Excluding Rare Clearances (Blue) 69,094 66,409 69,166 Clearance Rate (Normal Clearance / Total Excluding Rare) 32.3% 30.8% 31.5% Reported Year Blue = Excluded from model Yellow = Event in the Dependent Variable Flag (i.e. equal to 1) Green = Non-event in the DependentVariable Flag (i.e. equal to 0)
  • 16. Model Tuning Parameters CART (Simple and Normal) Complexity =0.001, Minimum Split = 1000, Minimum Bucket Size =1000, Maximum Depth = 5 Na誰ve Bayes Laplace Smoother = 3 GLM with Regularization Alpha = 1 (Lasso) GBM Number ofTrees = 200, Maximum Depth = 5, Interaction Depth = 2, Learning Rate = 0.2 Deep Learning 3 Hidden Layers, each with 200 nodes Random Forests Number ofTrees = 50, Maximum Depth = 10, Minimum Rows = 5, Number of Bins = 20

Editor's Notes

  • #3: The CART and Simple CART models performed well too. The CART model performed better than the simpler model, showing the trade-off that simplicity and interpretability can be exchanged for increased predictive power. Even better, both models showed little signs of overfitting as its performance was nearly identical on the training, validation and test dataset. GLM showed signs of overfitting. Its training accuracy was 82.6% while its test accuracy was 78.3%, which was lower than the Simple CART model. Likely, more rigorous feature transformation for non-linearities and perhaps other feature selection techniques (e.g. forward or backward stepwise) may provide less overfitting results. In conclusion, from a predictive accuracy point of view, GBM was the best model and predicted clear rates with nearly 85% (out-of-sample) accuracy. Nevertheless, this model remains largely a black box model in which its components are difficult to interpret. Therefore, for practical use, we recommend that CART models can perform quite well along with interpretable results that practitioners may find usable than black box algorithms like GBM and Deep Learning.
  • #4: Why are Clearance Rates important: They are official metrics tracked by local police departments and the FBI They measure how effective at solving and thus, with crime feedback theory, also at preventing crime Lower crime rates dont tell the whole story use example that tradeoff
  • #6: Our approach was to use the software and tools that would work best for the various parts of our project. For the data prep phase we used SQL, OpenRefine, and OpenGIS for the data wrangling. We then used ArcGIS and Tableau for exploring the data and looking for any high level patterns. External data sets were found and were run through SQL for standardization. Our datasets were merged with the external datasets we found, using the SAS EG software. Once we had our aggregated dataset we loaded this into R Studio for object building. Lastly, H2O was used for in-memory predictive analytics and fast data mining.
  • #8: Before running our models, we evaluated on a filter basis the variable of importance of each predictor using a statistical approach (Chi-Square). We chose Chi-Square given than nearly all of the variables were categorical and given that all variables were originally screened to ensure that they aligned to one of our hypotheses. However, as we explain later, most of our methods (like GBM and GLM with regularization) have their own wrapper based feature selection algorithms that will further refine the list of variables. Notice the crime types are consistent year after year.
  • #9: For classification, we surveyed a range of models going from simple and intuitive (CART) to more complex, black box models like Gradient Boosting Models and Deep Learning. For more advanced models, we used the H2O R Wrapper to run H2O. H2O is an open-source machine and deep learning suite of applications used to increase the scalability for a broad range of algorithms. It uses in-memory compression to run millions of rows of data with a small cluster. We started with a small decision tree where we selected for features with the largest predictive power. We called this our simple CART as it small and was easily interpretable. We then gave all of our features to a second decision tree to see if more variables would provide better predictive power. Using the H2O engine we then used a Na誰ve Bayes on a limited number of variables with a Laplace smoother (lambda = 3). Fourth, we ran Regularized (Lasso) Generalized Linear Regression. We ran regularization on the model in order to reduce unnecessary and redundant features that are included in the dataset. We selected regularized instead of stepwise given that only regularization was available in the H2O package. We selected Lasso (Alpha = 1) instead of Ridge (Alpha = 0) because we found the Lasso performed better on the validation dataset. In addition to the traditional methods (GLM, CART, Na誰ve Bayes), we ran three more advanced, black box methods: GBM, Deep Learning and Random Forests. For all three of these models, there were several tuning parameters (e.g. the number of trees and the maximum tree depth for GBM or the number of hidden neurons for Deep Learning).
  • #10: And here is what our simple decision tree looks like. This simple CART that restricted our decision tree to only the top variables (from filter selection) in order to gain intuition on our dataset. In particular, we restricted the type of crime variable to the variable Against rather than the more detailed NIBRS_Hi_Class or Category because this variable had far fewer classes (only four versus 30+) which made the interpretation much easier.
  • #14: A number of theories and crime models have been proposed over the years to explain the existence of a positive feedback loop between the level of crime at one point in time in a neighborhood, and the level of crime at a later point of time in the same neighborhood Dr. Weatherburn, a crime professor at Cambridge writes, In the book Delinquent-prone Communities the following, In the epidemic model of crime the positive feedback loop is created by the fact that each increase in the prevalence of involvement in crime expands the scope for further contact between delinquents and susceptibles, thereby fueling further increases in the level of participation in crime.