際際滷

際際滷Share a Scribd company logo
Prospecting and Mining
  MLB (Sports) Data
           Ryan Elmore
       rtelmore@gmail.com
         Twitter: rtelmore

          August 6, 2011
   Rocky Mountain SABR Meeting
Data Science
... And Hype
The Economist May 14-20,
2011: Corporate chefs
are in demand again, of鍖ce
rents are soaring and the
pay being offered to
talented folk in fashionable
鍖elds like data science
is reaching Hollywood
levels.
Not Your Typical Tech Talk
A Delayed Flight at DIA ...
Why Are The Games Boring?
Simmons Red Sox Data
Simmons Red Sox Data
A Few Thoughts ...
 The underlying data unit is in minutes. So?
 Why is he only looking at this particular set
  of year?
 How do the Red Sox compare to the other
  teams in MLB?
 Crap, that last point will require
  downloading a lot of data ... and my 鍖ight
  was boarding in 10 minutes!
Where Can We Get Data?
Where Can We Get Data?
http://www.baseball-reference.com/teams/COL/2010-schedule-scores.shtml
Where Can We Get Data?
http://www.baseball-reference.com/teams/COL/2010-schedule-scores.shtml




      Just step through
       all of the teams:
       COL, BOS, etc.
Where Can We Get Data?
http://www.baseball-reference.com/teams/COL/2010-schedule-scores.shtml




      Just step through                 and any years
       all of the teams:                 that you are
       COL, BOS, etc.                   interested in.
A Typical Website




          The Golden Nugget
Data All Teams
Another Visual Representation
Boston vs. The Rest
Are The Games Getting Longer?
  I dont know!
  I would say that the evidence supports an
   increase up until 2000 and then its been
   constant or slightly decreasing.
  This is not an exercise in statistical
   inference; I was just mining the data and
   looking for trends.
  thelogcabin.wordpress.com/
  github.com/rtelmore/MLB
Another Exercise

   In a conversation with Paul Parker, he asked
      if the minimum number of pitches per (full)
      inning (6 pitches) has ever been attained.
   This is a hard problem!
   Where do you 鍖nd this sort of data?
   Back to baseball-reference.com ... the box
      scores.
http://www.baseball-reference.com/boxes/COL/COL201104010.shtml
The Boxscore Website




The Golden Nugget
How Do We Proceed?
The most systematic way that I could 鍖nd
was to break it down like this:
 30 Teams
 2005 - 2010
 Everyday from Apr 1 through Oct 31
 This is a little more than 78K URLs!
 My program took about 3 hrs 25 min.
Was That Minimum Attained?

   NO! Unless there is an error in my code.
   Did we learn something? Of course.
   Example: I shouldve stored everything in a
      database while I was downloading and
      processing the data. Why? I didnt save any
      of the data from the 3+ hrs of computing.

http://www.baseball-reference.com/boxes/COL/COL201104010.shtml
Using Google Trends Data
 http://www.google.com/trends
 You can put a search term in and it will
  return a lot of historical statistics related to
  your query (e.g., 鍖u trends)
 There is an R package (RGoogleTrends)
  that allows access to the GT API if you have
  a google account (e.g., gmail).
 Use the getGTrends(query) function
GT Colorado Rockies
GTrends Colorado Rockies
                      World Series




                      Playoff Run
Google Trends NFL
Google Trends NFL
Google Trends NFL
Google Trends NFL
Conclusions/Discussion

 There is a lot of data available on the web!
 You can access this data from a browser;
  however, you can access A LOT more data
  if you let your computer do the work.
 Good tools for data mining: R, python,
  perl, etc.
 Download data and see where you go
Resources
 thelogcabin.wordpress.com
 github.com/rtelmore
 baseball-reference.com, espn.com, Google
  Trends, etc.
 Twitter (@rtelmore)
 www.r-project.org
 www.meetup.com/DenverRUG

More Related Content

Sabr

  • 1. Prospecting and Mining MLB (Sports) Data Ryan Elmore rtelmore@gmail.com Twitter: rtelmore August 6, 2011 Rocky Mountain SABR Meeting
  • 3. ... And Hype The Economist May 14-20, 2011: Corporate chefs are in demand again, of鍖ce rents are soaring and the pay being offered to talented folk in fashionable 鍖elds like data science is reaching Hollywood levels.
  • 4. Not Your Typical Tech Talk
  • 5. A Delayed Flight at DIA ...
  • 6. Why Are The Games Boring?
  • 9. A Few Thoughts ... The underlying data unit is in minutes. So? Why is he only looking at this particular set of year? How do the Red Sox compare to the other teams in MLB? Crap, that last point will require downloading a lot of data ... and my 鍖ight was boarding in 10 minutes!
  • 10. Where Can We Get Data?
  • 11. Where Can We Get Data? http://www.baseball-reference.com/teams/COL/2010-schedule-scores.shtml
  • 12. Where Can We Get Data? http://www.baseball-reference.com/teams/COL/2010-schedule-scores.shtml Just step through all of the teams: COL, BOS, etc.
  • 13. Where Can We Get Data? http://www.baseball-reference.com/teams/COL/2010-schedule-scores.shtml Just step through and any years all of the teams: that you are COL, BOS, etc. interested in.
  • 14. A Typical Website The Golden Nugget
  • 18. Are The Games Getting Longer? I dont know! I would say that the evidence supports an increase up until 2000 and then its been constant or slightly decreasing. This is not an exercise in statistical inference; I was just mining the data and looking for trends. thelogcabin.wordpress.com/ github.com/rtelmore/MLB
  • 19. Another Exercise In a conversation with Paul Parker, he asked if the minimum number of pitches per (full) inning (6 pitches) has ever been attained. This is a hard problem! Where do you 鍖nd this sort of data? Back to baseball-reference.com ... the box scores. http://www.baseball-reference.com/boxes/COL/COL201104010.shtml
  • 20. The Boxscore Website The Golden Nugget
  • 21. How Do We Proceed? The most systematic way that I could 鍖nd was to break it down like this: 30 Teams 2005 - 2010 Everyday from Apr 1 through Oct 31 This is a little more than 78K URLs! My program took about 3 hrs 25 min.
  • 22. Was That Minimum Attained? NO! Unless there is an error in my code. Did we learn something? Of course. Example: I shouldve stored everything in a database while I was downloading and processing the data. Why? I didnt save any of the data from the 3+ hrs of computing. http://www.baseball-reference.com/boxes/COL/COL201104010.shtml
  • 23. Using Google Trends Data http://www.google.com/trends You can put a search term in and it will return a lot of historical statistics related to your query (e.g., 鍖u trends) There is an R package (RGoogleTrends) that allows access to the GT API if you have a google account (e.g., gmail). Use the getGTrends(query) function
  • 25. GTrends Colorado Rockies World Series Playoff Run
  • 30. Conclusions/Discussion There is a lot of data available on the web! You can access this data from a browser; however, you can access A LOT more data if you let your computer do the work. Good tools for data mining: R, python, perl, etc. Download data and see where you go
  • 31. Resources thelogcabin.wordpress.com github.com/rtelmore baseball-reference.com, espn.com, Google Trends, etc. Twitter (@rtelmore) www.r-project.org www.meetup.com/DenverRUG