This document discusses prospecting and mining MLB (baseball) data from various sources on the internet. It describes analyzing Red Sox game data from a particular dataset, how to access play-by-play box score data from Baseball Reference to analyze pitching statistics, and using Google Trends data to examine search interest in topics like the Colorado Rockies and NFL over time. The document advocates downloading data programmatically using tools like R rather than just browsing to allow accessing larger datasets and performing more in-depth analyses.
1 of 31
Downloaded 14 times
More Related Content
Sabr
1. Prospecting and Mining
MLB (Sports) Data
Ryan Elmore
rtelmore@gmail.com
Twitter: rtelmore
August 6, 2011
Rocky Mountain SABR Meeting
3. ... And Hype
The Economist May 14-20,
2011: Corporate chefs
are in demand again, of鍖ce
rents are soaring and the
pay being offered to
talented folk in fashionable
鍖elds like data science
is reaching Hollywood
levels.
9. A Few Thoughts ...
The underlying data unit is in minutes. So?
Why is he only looking at this particular set
of year?
How do the Red Sox compare to the other
teams in MLB?
Crap, that last point will require
downloading a lot of data ... and my 鍖ight
was boarding in 10 minutes!
11. Where Can We Get Data?
http://www.baseball-reference.com/teams/COL/2010-schedule-scores.shtml
12. Where Can We Get Data?
http://www.baseball-reference.com/teams/COL/2010-schedule-scores.shtml
Just step through
all of the teams:
COL, BOS, etc.
13. Where Can We Get Data?
http://www.baseball-reference.com/teams/COL/2010-schedule-scores.shtml
Just step through and any years
all of the teams: that you are
COL, BOS, etc. interested in.
18. Are The Games Getting Longer?
I dont know!
I would say that the evidence supports an
increase up until 2000 and then its been
constant or slightly decreasing.
This is not an exercise in statistical
inference; I was just mining the data and
looking for trends.
thelogcabin.wordpress.com/
github.com/rtelmore/MLB
19. Another Exercise
In a conversation with Paul Parker, he asked
if the minimum number of pitches per (full)
inning (6 pitches) has ever been attained.
This is a hard problem!
Where do you 鍖nd this sort of data?
Back to baseball-reference.com ... the box
scores.
http://www.baseball-reference.com/boxes/COL/COL201104010.shtml
21. How Do We Proceed?
The most systematic way that I could 鍖nd
was to break it down like this:
30 Teams
2005 - 2010
Everyday from Apr 1 through Oct 31
This is a little more than 78K URLs!
My program took about 3 hrs 25 min.
22. Was That Minimum Attained?
NO! Unless there is an error in my code.
Did we learn something? Of course.
Example: I shouldve stored everything in a
database while I was downloading and
processing the data. Why? I didnt save any
of the data from the 3+ hrs of computing.
http://www.baseball-reference.com/boxes/COL/COL201104010.shtml
23. Using Google Trends Data
http://www.google.com/trends
You can put a search term in and it will
return a lot of historical statistics related to
your query (e.g., 鍖u trends)
There is an R package (RGoogleTrends)
that allows access to the GT API if you have
a google account (e.g., gmail).
Use the getGTrends(query) function
30. Conclusions/Discussion
There is a lot of data available on the web!
You can access this data from a browser;
however, you can access A LOT more data
if you let your computer do the work.
Good tools for data mining: R, python,
perl, etc.
Download data and see where you go