The document describes a cricket match prediction system that uses both historical and social media data. The system aims to develop a consistent statistical method to predict cricket match results by collecting vital attributes from a large dataset containing historical match statistics and real-time social media information. It analyzes this data using machine learning techniques to provide accurate predictions of cricket match outcomes before matches are played.
3. Why
Predicting
Cricket?
2 3 BILLION
Television
Audience is a fan
of cricket game.
Cricket is a
million dollar
game
4. AIM &
Objective
A consistent
statistical method
to predict results
Develop a dataset
containing vital
attributes
Predict the
outcome before
match is being
played
#3: Our system will predict the outcome of the cricket match on the basis of historical & data from social media
In this project, different approaches for a new time series prediction problem i.e. predicting the outcome of One-Day International (ODI) cricket match has been presented.
With the combination of both Historical Data and social media data we are able to predict the result more accurately.
#4: Todays sports professionals include not only the sportsmen actively participating in the game, but also their coaches, trainers, physiotherapists, and in many cases, strategists.
Players and team management (collectively often referred to as the team think-tank in sports) perform as a human expert system, relying on experiences, expertise and analytic ability to arrive at the best-possible course of action before as well as during a game.
Vast amount of raw data and statistics are available to aid in the decision-making process, but determining what it takes to win a game is extremely challenging.
#5: The primary aim of this project is to establish a consistent statistical approach to predict the outcome of the match.
To develop a dataset containing vital attributes that define match outcome.
To predict the outcome of the match before match is being played.
To help teams to be more focused on the match according to the prediction.
#7: Data Collection:
Previous match data was scraped from the cricket site cricsheet.org
Data Filtration:
The data from cricsheet have ball by ball data for every single match. We dont need ball by ball data, we just need summarize data that can give us the complete picture of the match, and for this purpose we perform data filtration.
Feature Construction
31 features are formed with in a clear hierarchy with 3 different levels.
Basic Feature
Net Features
Difference Features
Training
For training and classification we use 70% of our data for training. Use multiple models on that data so that we can improve our prediction.
Testing
We use 30% of our data set to check the accuracy of our results.
Prediction
With a specific end goal to make prediction we simply need to give the name of the team and rival name the system will give the expectation of the match in the wake of figuring diverse task. This is further explain in next section.
#9: Data Collection:
The first challenge was to collect the right data. We applied multiple queries to fetch huge data from twitter. Once we got the data, we extracted the selected attributes that helped us in data filtering step
Data Filtration:
After obtaining the data, we filtered the tweets from spam user or spam content. We considered several factors that classify the tweets from spam or ham.
Feature Reduction:
Structured Tweets are generally in sentence format, with URLs specified for images or blog articles. To get data that is in usable format we remove the stop words that contains general terms like a, the, etc and emoticons.
Training
For training and classification we use 70% of our data for training. Use multiple models on that data so that we can improve our prediction
Testing
We use 30% of our data set to check the accuracy of our results.
Prediction
In order to make prediction we just have to give the name of team and opponent name the system will give the prediction of the match after calculating different task. This is further explain in next chapter
#11: Data Collection:
Data was collected from cricsheet.
The data is provided inYAMLformat, a human-readable data format. There are libraries available to parse this in multiple languages.
In order to summarize data we have use R API called Yorkr.
This R package can be used to analyze performances of cricketers based on match data fromCricsheet
Using this R API we make processing on that yaml data to create database for all Matches.
Team Opponent Venue Date Runs Scored
Overs Bat Wicket Lost
Runs Conceded Overs Bowled Wicket Taken Result
Feature Construction
After applyingYork ron data, we retrieved the stored data fromMongodB using apache spark for feature construction. Then 31 features are created that list down in chapter 3. These featuresare then savedina CSV file
Training
For training and classification we use 70% of our data for training. Use multiple models on that data so that we can improve our prediction. We have used five different model for training our data set.
Naive Bayes, Logistic Regression, Random Forests, Decision Trees and SVM.
Testing
We use 30% of our data set to check the accuracy of our results. This 30% datawere randomly testedwith all above mentioned models to test the accuracy. Using Na誰veBayes, wegota maximumof accuracy that is 68% approximately.
Prediction
In order to make predictions we just have to give the name of team and opponent name the system gets the features from historical data for these two teams analyze the record and predict the outcome of the match
#14: Data Collection:
We applied multiple queries to fetch huge data from twitter.
The whole process was completed in 5 steps
Queries
Twitter API
Result
Data extraction
Saving in MongoDB
Data Filtration:
We retrieved the stored data from mongodb using apache spark and passed it through a series of filtration steps. We considered 6 factors that classify the tweets from spam or ham. Those factors are
Content requesting re-tweets and follows
Short content length
Large numbers of hashtags
Bot-Friendly content source
Users that create little content
Feature Reduction
We performed following operations on tweets in cleansing and normalizing phase.
Remove Retweets
Replace Usernames
Replace URLs
Remove Repeated letters
Remove Short Words
Remove Stop Words
Remove Non English Words
Training
For training and classification we use 70% of our data for training. Use multiple models on that data so that we can improve our prediction.
Testing
We use 30% of our data set to check the accuracy of our results. This 30% datawere randomly testedwith all above mentioned models to test the accuracy
Prediction
In order to make predictions we just have to give the name of team and opponent name the system gets the features from historical data for these two teams analyze the record and predict the outcome of the match
#16: PREDICTION THROUGH STREAMING TWEETS: Twitter opensourced its Hosebird client (hbc),a robust Java HTTP library for consuming Twitters Streaming API. We used hbcto create a Kafka twitter stream producer, which tracked our query terms in twitter statusesand produced a Kafka stream out of it, which was utilized later for sending that data from Kafka to Spark Streaming.
SPARK STREAMING: Spark Streaming is a real-time processing tool that runs on top of the Spark engine.
Apache Zookeeper
Apache Zookeeper is an effort to develop and maintain an open-source server which enables highly reliable distributed coordination.
Apache Kafka Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.