際際滷

際際滷Share a Scribd company logo
Web Mining of Drug
Reviews for Market
Analysis
Ajinkya Ingle
Rohan Waghere
Priyanka Bhandari
Gaurav Kshirsagar
Agenda
1. Introduction
2. Gathering data
3. Cleaning data
4. Exploratory data analysis
5. Reviews analysis
6. NLP analysis
7. Classification Algorithms
8. LDA Topic Modelling
9. Conclusion
Introduction
 The US has the largest pharmaceutical market in
the world with a value of $339 billion USD.
 US prescription drug spending is expected to
reach as high as $610 billion by 2021.
 Pfizer alone spent 7.6 billion dollars on R&D in
financial year 2017.
 The growth is expected to accelerate in coming
years.
 While these drugs are prescribed for their
therapeutic properties, their use may result in
unintended or adverse effects.
 There is a need for people to know the quality of
drugs in this overcrowded market.
Gathering Data - Data Source
 We chose WebMD as our primary source to gather data.
 It is considered a legitimate source of information for all sorts of drugs
Gathering Data
 Searched for top 5 drugs for a
given condition.
 Parsed the data from each of the
drugs page.
 Gathered reviews and ratings
from the drug review page.
 Stored the obtained results in a
csv file.
Gathering Data
 We parsed all the pages to acquire all the reviews.
 Age, Gender, Ratings and Comments were collected in a dataframe.
 Beautifulsoup was used to parse the data from webpages.
Ratings
Comments/Reviews
Age and Gender
Gathering Data - Final output
 The data had a lot of noise and repetitive terms.
 The age and gender had to be extracted from the Reviewer Details column
Cleaning Data
 Got rid of the repetitive terms
 Created separate columns for Age and Gender
Cleaning Data
 Got rid of rows with NA values or
null.
 Removed Punctuations
 Categorized age groups
 Assigned genders to respective
users.
Exploratory Data Analysis
Stats of the sample dataset
EDA: Gender
 Not a balanced gender
distribution
 Female: 67%
Male: 33%
EDA: Age
 Use increased with the
increase in age
 54.7% of the analgesics
users were between age
group 45 to 64
 13.1% of analgesics users
were above the age group
65
EDA: Effectiveness Rating
 26.6% of Analgesics were
had the highest
Effectiveness Ratings
 While comparable
percentage of ratings were
rated as the second
highest
EDA: Satisfaction Ratings
 32.2% of reviews had the
lowest Satisfaction
Ratings
 27.1% of reviews had the
Highest Satisfaction
Rating
EDA:Comparing two or more features
 Satisfaction Rating vs Ease
of Use Rating
 Ease of Using the
Analgesics doesnt assure
the satisfaction of the
customer
Effectiveness Rating
EDA:Comparing two or more features
 Male customers in 19-24 age category have higher effectiveness rating
 Whereas the opposite is true in 25-34 age category
 Drug has different effects in males and females in different age groups
Satisfaction Rating
Reviews Analysis
 Most of the long reviews come from people who rated 1, 4 or 5.
 Maximum reviews do not contain more than 50 words.
 We also calculated the review length to rating correlation to be 0.068
NLP - TFIDF vectorization
 We used TF-IDF vectorization to identify the important words in the documents.
 Extracted top 5 most important words in each document by its tfidf weight
 Gives a basic ideas of what people are talking about more frequently
Reveals side effects
experienced by
people
NLP - WordClouds
 WordCloud for female patients
 Migraine, headache, fever are
some of the most cited problems
that can be observed
 WordCloud for male patients
 Sluggish , headache, fever are
some of the most cited problems
that can be observed
Classification - Using Tfidf vectors and sklearn
 Classifying Satisfaction Rating based on
comments
 Created 3 buckets of reviews
 Comments
 Cleaned Comments
 Adjective Comments
 Gives a broader picture of variation in the
accuracy of the models
Classification - Confusion Matrices
 The model struggles with 1 star ratings classification .
 Max accuracy we obtained was of linear svm i.e. 52%
 We try a different approach to classify
Classification based on extracted features
 We estrated additional
features like number of
words, chars,..etc.
 Applied classification
algorithms using these
features.
 Compared the results with
previous method.
New
Features
Correlation of features
 This is a correlation plot where
we can see satisfaction rating
which is our chosen variable
contrasted with other variables
to see how they impact.
 We see effectiveness rating and
satisfaction rating have 0.86
correlation which depicts a high
correlation.
Age-wise ratings distribution
 Also for example I have shown satisfaction rating distribution
for every age interval and we can thus say that age group 45-
54 and 55-64 were highly satisfied with their feedback of
prescribed medicines.
Comparison of Different Classification Algorithms
 Classifying Satisfaction Rating
based on extracted features
 Demonstrating different
accuracies for our decision
variable in contrast to other
variables:
 KNN=40%
 Naive Bayes=59%
 Logistic
Regression=60.5%
Comparison of Different Classification Algorithms
 SVM=43.5%
 Random Forest=61%
 Neural Networks=59%
 Thus we can say Logistic
regression outperforms
other algorithms in our
scenario.
Parameters Tuning
 We used GridSearch parameters tuning to improve accuracy in MLP classifier
 But no significant performance improvement was observed.
 Average precision of 50% was obtained.
 Neural Networks gets better with more data.
LDA - Topic Modeling (Without TFIDF)
 Top 7 topics and their most frequent words.
 Headache and addictive are some of the perceived side-effects
LDA - Topic Modeling (With TFIDF)
 1st topic mentions the positive words, it means it works for a fair amount of
customers
 However, 2nd Topic shows the side effects like Dizziness and Itching
 Also, People who need more potent drugs use Vicodin instead of Tramadol
 Fibromyalgia is widespread muscle pain in back and shoulder. That one of the
reason patients are prescribed Tramadol to reduce pain
Sentiment Analysis
 More than half of the comments are observed to be subjective.
 Therefore the reliability reduced significantly
Analysis of Drug Side-effects - Comparative Study
OxyContinOxyCodone
 We have tried to analyze the side-effects of drugs and how they vary based on the age groups and drug.
Analysis of Drug Side-effects - Comparative Study
Methadone Morphine
Tramadol Oral All Drugs
Analysis of Drug Side-effects - Comparative Study
 This data can also be used by doctors, so that while prescribing these medicines he can also prescribe
other medicines to reduce the side-effects.
Conclusion
 WebMD reviews are not the ideal place for
drug companies to look for insights since
people are inherently biased either along a
positive side or a negative one.
 However, it is a good source to know the
customers sentiments, the drug side-effects
and study competitors drugs.
Headaches: When Is It an Emergency? The first
page contains no hard facts  you have to click and
thereby drive up the sites lucrative click-throughs
 but instead quickly transforms visitors from Web
users with headaches to hard-core migraineurs and
drug.
- Virginia Hefferman (The New York Times)
Thank You !!

More Related Content

Similar to Web mining of drug reviews for market analysis (20)

13Test Development Proposal Step OneJane
13Test Development Proposal Step OneJane 13Test Development Proposal Step OneJane
13Test Development Proposal Step OneJane
AnastaciaShadelb
13Test Development Proposal Step OneJane
13Test Development Proposal Step OneJane 13Test Development Proposal Step OneJane
13Test Development Proposal Step OneJane
ChantellPantoja184
Pharmacoeconomics
PharmacoeconomicsPharmacoeconomics
Pharmacoeconomics
Dr VARUN RAGHAVAN
Considering adverse effects in prioritising reviews
Considering adverse effects in prioritising reviewsConsidering adverse effects in prioritising reviews
Considering adverse effects in prioritising reviews
Cochrane.Collaboration
Curofy report on - Doctors opinion - Prescription of "Generics vs Brands"
Curofy report on - Doctors opinion - Prescription of "Generics vs Brands"Curofy report on - Doctors opinion - Prescription of "Generics vs Brands"
Curofy report on - Doctors opinion - Prescription of "Generics vs Brands"
Mudit Vijayvergiya
An Introduction Patient Reported Outcome Measures (PROMS)
An Introduction Patient Reported Outcome Measures (PROMS)An Introduction Patient Reported Outcome Measures (PROMS)
An Introduction Patient Reported Outcome Measures (PROMS)
Keith Meadows
Multisource feedback & its utility
Multisource feedback & its utilityMultisource feedback & its utility
Multisource feedback & its utility
IAMRAreval2015
Read the following information and understand the content, as you .docx
Read the following information and understand the content, as you .docxRead the following information and understand the content, as you .docx
Read the following information and understand the content, as you .docx
fterry1
Ad SAM pharm pre presentation august 2011
Ad SAM  pharm  pre presentation august 2011Ad SAM  pharm  pre presentation august 2011
Ad SAM pharm pre presentation august 2011
AdSAM2
Pharmacoeconomics.ppt
Pharmacoeconomics.pptPharmacoeconomics.ppt
Pharmacoeconomics.ppt
Pabitra Thapa
Hta basic introduction
Hta basic introductionHta basic introduction
Hta basic introduction
Canadian Organization for Rare Disorders
White Paper: Breakthrough Behavioral Network
White Paper: Breakthrough Behavioral NetworkWhite Paper: Breakthrough Behavioral Network
White Paper: Breakthrough Behavioral Network
Mark Gall
Study Eligibility Criteria
Study Eligibility CriteriaStudy Eligibility Criteria
Study Eligibility Criteria
Effective Health Care Program
Anxiety presentation final
Anxiety presentation finalAnxiety presentation final
Anxiety presentation final
natort12
Drug Testing Index
Drug Testing IndexDrug Testing Index
Drug Testing Index
Quest Diagnostics Employer Solutions
Decision Point Onesavella 12.5mg orally on day one, day 2 12.5mg
Decision Point Onesavella 12.5mg orally on day one, day 2 12.5mgDecision Point Onesavella 12.5mg orally on day one, day 2 12.5mg
Decision Point Onesavella 12.5mg orally on day one, day 2 12.5mg
jeniihykdevara
Jennifer L. NaegeleDr. Daniel WestHAD - 517Jun.docx
Jennifer L. NaegeleDr. Daniel WestHAD - 517Jun.docxJennifer L. NaegeleDr. Daniel WestHAD - 517Jun.docx
Jennifer L. NaegeleDr. Daniel WestHAD - 517Jun.docx
donnajames55
Neurodevelopmental Treatment and Cerebral Palsy- Research
Neurodevelopmental Treatment and Cerebral Palsy- ResearchNeurodevelopmental Treatment and Cerebral Palsy- Research
Neurodevelopmental Treatment and Cerebral Palsy- Research
da5884
Personalized medicine - putting the 'Mind' inside
Personalized medicine - putting the 'Mind' insidePersonalized medicine - putting the 'Mind' inside
Personalized medicine - putting the 'Mind' inside
Howard Moskowitz
Advanced PubMed: Finding PT Evidence
Advanced PubMed: Finding PT EvidenceAdvanced PubMed: Finding PT Evidence
Advanced PubMed: Finding PT Evidence
Library
13Test Development Proposal Step OneJane
13Test Development Proposal Step OneJane 13Test Development Proposal Step OneJane
13Test Development Proposal Step OneJane
AnastaciaShadelb
13Test Development Proposal Step OneJane
13Test Development Proposal Step OneJane 13Test Development Proposal Step OneJane
13Test Development Proposal Step OneJane
ChantellPantoja184
Considering adverse effects in prioritising reviews
Considering adverse effects in prioritising reviewsConsidering adverse effects in prioritising reviews
Considering adverse effects in prioritising reviews
Cochrane.Collaboration
Curofy report on - Doctors opinion - Prescription of "Generics vs Brands"
Curofy report on - Doctors opinion - Prescription of "Generics vs Brands"Curofy report on - Doctors opinion - Prescription of "Generics vs Brands"
Curofy report on - Doctors opinion - Prescription of "Generics vs Brands"
Mudit Vijayvergiya
An Introduction Patient Reported Outcome Measures (PROMS)
An Introduction Patient Reported Outcome Measures (PROMS)An Introduction Patient Reported Outcome Measures (PROMS)
An Introduction Patient Reported Outcome Measures (PROMS)
Keith Meadows
Multisource feedback & its utility
Multisource feedback & its utilityMultisource feedback & its utility
Multisource feedback & its utility
IAMRAreval2015
Read the following information and understand the content, as you .docx
Read the following information and understand the content, as you .docxRead the following information and understand the content, as you .docx
Read the following information and understand the content, as you .docx
fterry1
Ad SAM pharm pre presentation august 2011
Ad SAM  pharm  pre presentation august 2011Ad SAM  pharm  pre presentation august 2011
Ad SAM pharm pre presentation august 2011
AdSAM2
Pharmacoeconomics.ppt
Pharmacoeconomics.pptPharmacoeconomics.ppt
Pharmacoeconomics.ppt
Pabitra Thapa
White Paper: Breakthrough Behavioral Network
White Paper: Breakthrough Behavioral NetworkWhite Paper: Breakthrough Behavioral Network
White Paper: Breakthrough Behavioral Network
Mark Gall
Anxiety presentation final
Anxiety presentation finalAnxiety presentation final
Anxiety presentation final
natort12
Decision Point Onesavella 12.5mg orally on day one, day 2 12.5mg
Decision Point Onesavella 12.5mg orally on day one, day 2 12.5mgDecision Point Onesavella 12.5mg orally on day one, day 2 12.5mg
Decision Point Onesavella 12.5mg orally on day one, day 2 12.5mg
jeniihykdevara
Jennifer L. NaegeleDr. Daniel WestHAD - 517Jun.docx
Jennifer L. NaegeleDr. Daniel WestHAD - 517Jun.docxJennifer L. NaegeleDr. Daniel WestHAD - 517Jun.docx
Jennifer L. NaegeleDr. Daniel WestHAD - 517Jun.docx
donnajames55
Neurodevelopmental Treatment and Cerebral Palsy- Research
Neurodevelopmental Treatment and Cerebral Palsy- ResearchNeurodevelopmental Treatment and Cerebral Palsy- Research
Neurodevelopmental Treatment and Cerebral Palsy- Research
da5884
Personalized medicine - putting the 'Mind' inside
Personalized medicine - putting the 'Mind' insidePersonalized medicine - putting the 'Mind' inside
Personalized medicine - putting the 'Mind' inside
Howard Moskowitz
Advanced PubMed: Finding PT Evidence
Advanced PubMed: Finding PT EvidenceAdvanced PubMed: Finding PT Evidence
Advanced PubMed: Finding PT Evidence
Library

Recently uploaded (20)

The Marketability of Rice Straw Yarn Among Selected Customers of Gantsilyo Guru
The Marketability of Rice Straw Yarn Among Selected Customers of Gantsilyo GuruThe Marketability of Rice Straw Yarn Among Selected Customers of Gantsilyo Guru
The Marketability of Rice Straw Yarn Among Selected Customers of Gantsilyo Guru
kenyoncenteno12
The truth behind the numbers: spotting statistical misuse.pptx
The truth behind the numbers: spotting statistical misuse.pptxThe truth behind the numbers: spotting statistical misuse.pptx
The truth behind the numbers: spotting statistical misuse.pptx
andyprosser3
Introduction Lecture 01 Data Science.pdf
Introduction Lecture 01 Data Science.pdfIntroduction Lecture 01 Data Science.pdf
Introduction Lecture 01 Data Science.pdf
messagetome133
IFRS Finance Powerpoint ppt Finance D.pptx
IFRS Finance Powerpoint  ppt Finance D.pptxIFRS Finance Powerpoint  ppt Finance D.pptx
IFRS Finance Powerpoint ppt Finance D.pptx
amantiwari2091
Presentation1.pptx for data and table analysis
Presentation1.pptx for data and table analysisPresentation1.pptx for data and table analysis
Presentation1.pptx for data and table analysis
vatsalsingla4
Introduction to Java Programming for High School by 際際滷sgo.pptx
Introduction to Java Programming for High School by 際際滷sgo.pptxIntroduction to Java Programming for High School by 際際滷sgo.pptx
Introduction to Java Programming for High School by 際際滷sgo.pptx
mirhuzaifahali
Analyzing Consumer Spending Trends and Purchasing Behavior
Analyzing Consumer Spending Trends and Purchasing BehaviorAnalyzing Consumer Spending Trends and Purchasing Behavior
Analyzing Consumer Spending Trends and Purchasing Behavior
omololaokeowo1
Presentation.2 .reversal. reversal. pptx
Presentation.2 .reversal. reversal. pptxPresentation.2 .reversal. reversal. pptx
Presentation.2 .reversal. reversal. pptx
siliaselim87
Kaggle & Datathons: A Practical Guide to AI Competitions
Kaggle & Datathons: A Practical Guide to AI CompetitionsKaggle & Datathons: A Practical Guide to AI Competitions
Kaggle & Datathons: A Practical Guide to AI Competitions
rasheedsrq
vnptloveeeeeeeeeeeeeeeeeeeeeeeeeeee.pptx
vnptloveeeeeeeeeeeeeeeeeeeeeeeeeeee.pptxvnptloveeeeeeeeeeeeeeeeeeeeeeeeeeee.pptx
vnptloveeeeeeeeeeeeeeeeeeeeeeeeeeee.pptx
deomom129
HIRE MUYERN TRUST HACKER FOR AUTHENTIC CYBER SERVICES
HIRE MUYERN TRUST HACKER FOR AUTHENTIC CYBER SERVICESHIRE MUYERN TRUST HACKER FOR AUTHENTIC CYBER SERVICES
HIRE MUYERN TRUST HACKER FOR AUTHENTIC CYBER SERVICES
anastasiapenova16
Stasiun kernel pengolahan kelapa sawit indonesia
Stasiun kernel pengolahan kelapa sawit indonesiaStasiun kernel pengolahan kelapa sawit indonesia
Stasiun kernel pengolahan kelapa sawit indonesia
fikrimanurung1
Lesson 6- Data Visualization and Reporting.pptx
Lesson 6- Data Visualization and Reporting.pptxLesson 6- Data Visualization and Reporting.pptx
Lesson 6- Data Visualization and Reporting.pptx
1045858
Updated Willow 2025 Media Deck_Updated010325.pdf
Updated Willow 2025 Media Deck_Updated010325.pdfUpdated Willow 2025 Media Deck_Updated010325.pdf
Updated Willow 2025 Media Deck_Updated010325.pdf
tangramcommunication
MTC Supply Chain Management Strategy.pptx
MTC Supply Chain Management Strategy.pptxMTC Supply Chain Management Strategy.pptx
MTC Supply Chain Management Strategy.pptx
Rakshit Porwal
stages-of-moral-development-lawrence-kohlberg-pdf-free.pdf
stages-of-moral-development-lawrence-kohlberg-pdf-free.pdfstages-of-moral-development-lawrence-kohlberg-pdf-free.pdf
stages-of-moral-development-lawrence-kohlberg-pdf-free.pdf
esguerramark1991
iam free indeed.pptxiam free indeed.pptx
iam free indeed.pptxiam free indeed.pptxiam free indeed.pptxiam free indeed.pptx
iam free indeed.pptxiam free indeed.pptx
muhweziart
643663189-Q4W3-Synthesize-Information-1-pptx.pptx
643663189-Q4W3-Synthesize-Information-1-pptx.pptx643663189-Q4W3-Synthesize-Information-1-pptx.pptx
643663189-Q4W3-Synthesize-Information-1-pptx.pptx
rossanthonytan130
19th Edition Of International Research Data Analysis Excellence Awards
19th Edition Of International Research Data Analysis Excellence Awards19th Edition Of International Research Data Analysis Excellence Awards
19th Edition Of International Research Data Analysis Excellence Awards
dataanalysisconferen
Valkey 101 - SCaLE 22x March 2025 Stokes.pdf
Valkey 101 - SCaLE 22x March 2025 Stokes.pdfValkey 101 - SCaLE 22x March 2025 Stokes.pdf
Valkey 101 - SCaLE 22x March 2025 Stokes.pdf
Dave Stokes
The Marketability of Rice Straw Yarn Among Selected Customers of Gantsilyo Guru
The Marketability of Rice Straw Yarn Among Selected Customers of Gantsilyo GuruThe Marketability of Rice Straw Yarn Among Selected Customers of Gantsilyo Guru
The Marketability of Rice Straw Yarn Among Selected Customers of Gantsilyo Guru
kenyoncenteno12
The truth behind the numbers: spotting statistical misuse.pptx
The truth behind the numbers: spotting statistical misuse.pptxThe truth behind the numbers: spotting statistical misuse.pptx
The truth behind the numbers: spotting statistical misuse.pptx
andyprosser3
Introduction Lecture 01 Data Science.pdf
Introduction Lecture 01 Data Science.pdfIntroduction Lecture 01 Data Science.pdf
Introduction Lecture 01 Data Science.pdf
messagetome133
IFRS Finance Powerpoint ppt Finance D.pptx
IFRS Finance Powerpoint  ppt Finance D.pptxIFRS Finance Powerpoint  ppt Finance D.pptx
IFRS Finance Powerpoint ppt Finance D.pptx
amantiwari2091
Presentation1.pptx for data and table analysis
Presentation1.pptx for data and table analysisPresentation1.pptx for data and table analysis
Presentation1.pptx for data and table analysis
vatsalsingla4
Introduction to Java Programming for High School by 際際滷sgo.pptx
Introduction to Java Programming for High School by 際際滷sgo.pptxIntroduction to Java Programming for High School by 際際滷sgo.pptx
Introduction to Java Programming for High School by 際際滷sgo.pptx
mirhuzaifahali
Analyzing Consumer Spending Trends and Purchasing Behavior
Analyzing Consumer Spending Trends and Purchasing BehaviorAnalyzing Consumer Spending Trends and Purchasing Behavior
Analyzing Consumer Spending Trends and Purchasing Behavior
omololaokeowo1
Presentation.2 .reversal. reversal. pptx
Presentation.2 .reversal. reversal. pptxPresentation.2 .reversal. reversal. pptx
Presentation.2 .reversal. reversal. pptx
siliaselim87
Kaggle & Datathons: A Practical Guide to AI Competitions
Kaggle & Datathons: A Practical Guide to AI CompetitionsKaggle & Datathons: A Practical Guide to AI Competitions
Kaggle & Datathons: A Practical Guide to AI Competitions
rasheedsrq
vnptloveeeeeeeeeeeeeeeeeeeeeeeeeeee.pptx
vnptloveeeeeeeeeeeeeeeeeeeeeeeeeeee.pptxvnptloveeeeeeeeeeeeeeeeeeeeeeeeeeee.pptx
vnptloveeeeeeeeeeeeeeeeeeeeeeeeeeee.pptx
deomom129
HIRE MUYERN TRUST HACKER FOR AUTHENTIC CYBER SERVICES
HIRE MUYERN TRUST HACKER FOR AUTHENTIC CYBER SERVICESHIRE MUYERN TRUST HACKER FOR AUTHENTIC CYBER SERVICES
HIRE MUYERN TRUST HACKER FOR AUTHENTIC CYBER SERVICES
anastasiapenova16
Stasiun kernel pengolahan kelapa sawit indonesia
Stasiun kernel pengolahan kelapa sawit indonesiaStasiun kernel pengolahan kelapa sawit indonesia
Stasiun kernel pengolahan kelapa sawit indonesia
fikrimanurung1
Lesson 6- Data Visualization and Reporting.pptx
Lesson 6- Data Visualization and Reporting.pptxLesson 6- Data Visualization and Reporting.pptx
Lesson 6- Data Visualization and Reporting.pptx
1045858
Updated Willow 2025 Media Deck_Updated010325.pdf
Updated Willow 2025 Media Deck_Updated010325.pdfUpdated Willow 2025 Media Deck_Updated010325.pdf
Updated Willow 2025 Media Deck_Updated010325.pdf
tangramcommunication
MTC Supply Chain Management Strategy.pptx
MTC Supply Chain Management Strategy.pptxMTC Supply Chain Management Strategy.pptx
MTC Supply Chain Management Strategy.pptx
Rakshit Porwal
stages-of-moral-development-lawrence-kohlberg-pdf-free.pdf
stages-of-moral-development-lawrence-kohlberg-pdf-free.pdfstages-of-moral-development-lawrence-kohlberg-pdf-free.pdf
stages-of-moral-development-lawrence-kohlberg-pdf-free.pdf
esguerramark1991
iam free indeed.pptxiam free indeed.pptx
iam free indeed.pptxiam free indeed.pptxiam free indeed.pptxiam free indeed.pptx
iam free indeed.pptxiam free indeed.pptx
muhweziart
643663189-Q4W3-Synthesize-Information-1-pptx.pptx
643663189-Q4W3-Synthesize-Information-1-pptx.pptx643663189-Q4W3-Synthesize-Information-1-pptx.pptx
643663189-Q4W3-Synthesize-Information-1-pptx.pptx
rossanthonytan130
19th Edition Of International Research Data Analysis Excellence Awards
19th Edition Of International Research Data Analysis Excellence Awards19th Edition Of International Research Data Analysis Excellence Awards
19th Edition Of International Research Data Analysis Excellence Awards
dataanalysisconferen
Valkey 101 - SCaLE 22x March 2025 Stokes.pdf
Valkey 101 - SCaLE 22x March 2025 Stokes.pdfValkey 101 - SCaLE 22x March 2025 Stokes.pdf
Valkey 101 - SCaLE 22x March 2025 Stokes.pdf
Dave Stokes

Web mining of drug reviews for market analysis

  • 1. Web Mining of Drug Reviews for Market Analysis Ajinkya Ingle Rohan Waghere Priyanka Bhandari Gaurav Kshirsagar
  • 2. Agenda 1. Introduction 2. Gathering data 3. Cleaning data 4. Exploratory data analysis 5. Reviews analysis 6. NLP analysis 7. Classification Algorithms 8. LDA Topic Modelling 9. Conclusion
  • 3. Introduction The US has the largest pharmaceutical market in the world with a value of $339 billion USD. US prescription drug spending is expected to reach as high as $610 billion by 2021. Pfizer alone spent 7.6 billion dollars on R&D in financial year 2017. The growth is expected to accelerate in coming years. While these drugs are prescribed for their therapeutic properties, their use may result in unintended or adverse effects. There is a need for people to know the quality of drugs in this overcrowded market.
  • 4. Gathering Data - Data Source We chose WebMD as our primary source to gather data. It is considered a legitimate source of information for all sorts of drugs
  • 5. Gathering Data Searched for top 5 drugs for a given condition. Parsed the data from each of the drugs page. Gathered reviews and ratings from the drug review page. Stored the obtained results in a csv file.
  • 6. Gathering Data We parsed all the pages to acquire all the reviews. Age, Gender, Ratings and Comments were collected in a dataframe. Beautifulsoup was used to parse the data from webpages. Ratings Comments/Reviews Age and Gender
  • 7. Gathering Data - Final output The data had a lot of noise and repetitive terms. The age and gender had to be extracted from the Reviewer Details column
  • 8. Cleaning Data Got rid of the repetitive terms Created separate columns for Age and Gender
  • 9. Cleaning Data Got rid of rows with NA values or null. Removed Punctuations Categorized age groups Assigned genders to respective users.
  • 10. Exploratory Data Analysis Stats of the sample dataset
  • 11. EDA: Gender Not a balanced gender distribution Female: 67% Male: 33%
  • 12. EDA: Age Use increased with the increase in age 54.7% of the analgesics users were between age group 45 to 64 13.1% of analgesics users were above the age group 65
  • 13. EDA: Effectiveness Rating 26.6% of Analgesics were had the highest Effectiveness Ratings While comparable percentage of ratings were rated as the second highest
  • 14. EDA: Satisfaction Ratings 32.2% of reviews had the lowest Satisfaction Ratings 27.1% of reviews had the Highest Satisfaction Rating
  • 15. EDA:Comparing two or more features Satisfaction Rating vs Ease of Use Rating Ease of Using the Analgesics doesnt assure the satisfaction of the customer
  • 16. Effectiveness Rating EDA:Comparing two or more features Male customers in 19-24 age category have higher effectiveness rating Whereas the opposite is true in 25-34 age category Drug has different effects in males and females in different age groups Satisfaction Rating
  • 17. Reviews Analysis Most of the long reviews come from people who rated 1, 4 or 5. Maximum reviews do not contain more than 50 words. We also calculated the review length to rating correlation to be 0.068
  • 18. NLP - TFIDF vectorization We used TF-IDF vectorization to identify the important words in the documents. Extracted top 5 most important words in each document by its tfidf weight Gives a basic ideas of what people are talking about more frequently Reveals side effects experienced by people
  • 19. NLP - WordClouds WordCloud for female patients Migraine, headache, fever are some of the most cited problems that can be observed WordCloud for male patients Sluggish , headache, fever are some of the most cited problems that can be observed
  • 20. Classification - Using Tfidf vectors and sklearn Classifying Satisfaction Rating based on comments Created 3 buckets of reviews Comments Cleaned Comments Adjective Comments Gives a broader picture of variation in the accuracy of the models
  • 21. Classification - Confusion Matrices The model struggles with 1 star ratings classification . Max accuracy we obtained was of linear svm i.e. 52% We try a different approach to classify
  • 22. Classification based on extracted features We estrated additional features like number of words, chars,..etc. Applied classification algorithms using these features. Compared the results with previous method. New Features
  • 23. Correlation of features This is a correlation plot where we can see satisfaction rating which is our chosen variable contrasted with other variables to see how they impact. We see effectiveness rating and satisfaction rating have 0.86 correlation which depicts a high correlation.
  • 24. Age-wise ratings distribution Also for example I have shown satisfaction rating distribution for every age interval and we can thus say that age group 45- 54 and 55-64 were highly satisfied with their feedback of prescribed medicines.
  • 25. Comparison of Different Classification Algorithms Classifying Satisfaction Rating based on extracted features Demonstrating different accuracies for our decision variable in contrast to other variables: KNN=40% Naive Bayes=59% Logistic Regression=60.5%
  • 26. Comparison of Different Classification Algorithms SVM=43.5% Random Forest=61% Neural Networks=59% Thus we can say Logistic regression outperforms other algorithms in our scenario.
  • 27. Parameters Tuning We used GridSearch parameters tuning to improve accuracy in MLP classifier But no significant performance improvement was observed. Average precision of 50% was obtained. Neural Networks gets better with more data.
  • 28. LDA - Topic Modeling (Without TFIDF) Top 7 topics and their most frequent words. Headache and addictive are some of the perceived side-effects
  • 29. LDA - Topic Modeling (With TFIDF) 1st topic mentions the positive words, it means it works for a fair amount of customers However, 2nd Topic shows the side effects like Dizziness and Itching Also, People who need more potent drugs use Vicodin instead of Tramadol Fibromyalgia is widespread muscle pain in back and shoulder. That one of the reason patients are prescribed Tramadol to reduce pain
  • 30. Sentiment Analysis More than half of the comments are observed to be subjective. Therefore the reliability reduced significantly
  • 31. Analysis of Drug Side-effects - Comparative Study OxyContinOxyCodone We have tried to analyze the side-effects of drugs and how they vary based on the age groups and drug.
  • 32. Analysis of Drug Side-effects - Comparative Study Methadone Morphine
  • 33. Tramadol Oral All Drugs Analysis of Drug Side-effects - Comparative Study This data can also be used by doctors, so that while prescribing these medicines he can also prescribe other medicines to reduce the side-effects.
  • 34. Conclusion WebMD reviews are not the ideal place for drug companies to look for insights since people are inherently biased either along a positive side or a negative one. However, it is a good source to know the customers sentiments, the drug side-effects and study competitors drugs. Headaches: When Is It an Emergency? The first page contains no hard facts you have to click and thereby drive up the sites lucrative click-throughs but instead quickly transforms visitors from Web users with headaches to hard-core migraineurs and drug. - Virginia Hefferman (The New York Times)

Editor's Notes

  • #11: Knowledge derived from Analysis: Exploratory and Confirmatory nature Detection of Mistakes Checking for assumptions Determining relationships between explanatory variables Assessing the direction and rough size of relationships between exploratory and outcome variables Helps in preliminary selection of appropriate models of the relationship between outcome and other explanatory variables
  • #21: Source : https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html