�ݺ�ߣ

Web Mining of Drug
Reviews for Market
Analysis
Ajinkya Ingle
Rohan Waghere
Priyanka Bhandari
Gaurav Kshirsagar

Agenda
1. Introduction
2. Gathering data
3. Cleaning data
4. Exploratory data analysis
5. Reviews analysis
6. NLP analysis
7. Classification Algorithms
8. LDA Topic Modelling
9. Conclusion

Introduction
▹ The US has the largest pharmaceutical market in
the world with a value of $339 billion USD.
▹ US prescription drug spending is expected to
reach as high as $610 billion by 2021.
▹ Pfizer alone spent 7.6 billion dollars on R&D in
financial year 2017.
▹ The growth is expected to accelerate in coming
years.
▹ While these drugs are prescribed for their
therapeutic properties, their use may result in
unintended or adverse effects.
▹ There is a need for people to know the quality of
drugs in this overcrowded market.

Gathering Data - Data Source
▹ We chose WebMD as our primary source to gather data.
▹ It is considered a legitimate source of information for all sorts of drugs

Gathering Data
▹ Searched for top 5 drugs for a
given condition.
▹ Parsed the data from each of the
drugs’ page.
▹ Gathered reviews and ratings
from the drug review page.
▹ Stored the obtained results in a
csv file.

Gathering Data
▹ We parsed all the pages to acquire all the reviews.
▹ Age, Gender, Ratings and Comments were collected in a dataframe.
▹ Beautifulsoup was used to parse the data from webpages.
Ratings
Comments/Reviews
Age and Gender

Gathering Data - Final output
▹ The data had a lot of noise and repetitive terms.
▹ The age and gender had to be extracted from the “Reviewer Details” column

Cleaning Data
▹ Got rid of the repetitive terms
▹ Created separate columns for Age and Gender

Cleaning Data
▹ Got rid of rows with NA values or
null.
▹ Removed Punctuations
▹ Categorized age groups
▹ Assigned genders to respective
users.

Exploratory Data Analysis
Stats of the sample dataset

EDA: Gender
▹ Not a balanced gender
distribution
▹ Female: 67%
Male: 33%

EDA: Age
▹ Use increased with the
increase in age
▹ 54.7% of the analgesics
users were between age
group 45 to 64
▹ 13.1% of analgesics users
were above the age group
65

EDA: Effectiveness Rating
▹ 26.6% of Analgesics were
had the highest
Effectiveness Ratings
▹ While comparable
percentage of ratings were
rated as the second
highest

EDA: Satisfaction Ratings
▹ 32.2% of reviews had the
lowest Satisfaction
Ratings
▹ 27.1% of reviews had the
Highest Satisfaction
Rating

EDA:Comparing two or more features
▹ Satisfaction Rating vs Ease
of Use Rating
▹ Ease of Using the
Analgesics doesn’t assure
the satisfaction of the
customer

Effectiveness Rating
EDA:Comparing two or more features
▹ Male customers in 19-24 age category have higher effectiveness rating
▹ Whereas the opposite is true in 25-34 age category
▹ Drug has different effects in males and females in different age groups
Satisfaction Rating

Reviews Analysis
▹ Most of the long reviews come from people who rated 1, 4 or 5.
▹ Maximum reviews do not contain more than 50 words.
▹ We also calculated the review length to rating correlation to be 0.068

NLP - TFIDF vectorization
▹ We used TF-IDF vectorization to identify the important words in the documents.
▹ Extracted top 5 most important words in each document by its tfidf weight
▹ Gives a basic ideas of what people are talking about more frequently
Reveals side effects
experienced by
people

NLP - WordClouds
● WordCloud for female patients
● Migraine, headache, fever are
some of the most cited problems
that can be observed
● WordCloud for male patients
● Sluggish , headache, fever are
some of the most cited problems
that can be observed

Classification - Using Tfidf vectors and sklearn
▹ Classifying Satisfaction Rating based on
comments
▹ Created 3 buckets of reviews
▸ Comments
▸ Cleaned Comments
▸ Adjective Comments
▹ Gives a broader picture of variation in the
accuracy of the models

Classification - Confusion Matrices
▹ The model struggles with 1 star ratings classification .
▹ Max accuracy we obtained was of linear svm i.e. 52%
▹ We try a different approach to classify

Classification based on extracted features
▹ We estrated additional
features like number of
words, chars,..etc.
▹ Applied classification
algorithms using these
features.
▹ Compared the results with
previous method.
New
Features

Correlation of features
▹ This is a correlation plot where
we can see satisfaction rating
which is our chosen variable
contrasted with other variables
to see how they impact.
▹ We see effectiveness rating and
satisfaction rating have 0.86
correlation which depicts a high
correlation.

Age-wise ratings distribution
▹ Also for example I have shown satisfaction rating distribution
for every age interval and we can thus say that age group 45-
54 and 55-64 were highly satisfied with their feedback of
prescribed medicines.

Comparison of Different Classification Algorithms
▹ Classifying Satisfaction Rating
based on extracted features
▹ Demonstrating different
accuracies for our decision
variable in contrast to other
variables:
▸ KNN=40%
▸ Naive Bayes=59%
▸ Logistic
Regression=60.5%

Comparison of Different Classification Algorithms
▹ SVM=43.5%
▹ Random Forest=61%
▹ Neural Networks=59%
▹ Thus we can say Logistic
regression outperforms
other algorithms in our
scenario.

Parameters Tuning
▹ We used GridSearch parameters tuning to improve accuracy in MLP classifier
▹ But no significant performance improvement was observed.
▹ Average precision of 50% was obtained.
▹ Neural Networks gets better with more data.

LDA - Topic Modeling (Without TFIDF)
▹ Top 7 topics and their most frequent words.
▹ ‘Headache’ and ‘addictive’ are some of the perceived side-effects

LDA - Topic Modeling (With TFIDF)
▹ 1st topic mentions the positive words, it means it works for a fair amount of
customers
▹ However, 2nd Topic shows the side effects like “Dizziness” and “Itching”
▹ Also, People who need more potent drugs use “Vicodin” instead of “Tramadol”
▹ Fibromyalgia is widespread muscle pain in back and shoulder. That one of the
reason patients are prescribed Tramadol to reduce pain

Sentiment Analysis
▹ More than half of the comments are observed to be subjective.
▹ Therefore the reliability reduced significantly

Analysis of Drug Side-effects - Comparative Study
OxyContinOxyCodone
▹ We have tried to analyze the side-effects of drugs and how they vary based on the age groups and drug.

Methadone Morphine

Tramadol Oral All Drugs
▹ This data can also be used by doctors, so that while prescribing these medicines he can also prescribe
other medicines to reduce the side-effects.

Conclusion
▹ WebMD reviews are not the ideal place for
drug companies to look for insights since
people are inherently biased either along a
positive side or a negative one.
▹ However, it is a good source to know the
customers sentiments, the drug side-effects
and study competitors drugs.
“Headaches: When Is It an Emergency?” The first
page contains no hard facts — you have to click and
thereby drive up the site’s lucrative click-throughs
— but instead quickly transforms visitors from Web
users with headaches to hard-core migraineurs and
drug.
- Virginia Hefferman (The New York Times)

�ݺ�ߣ

Web mining of drug reviews for market analysis

Recommended

More Related Content

Similar to Web mining of drug reviews for market analysis (20)

Recently uploaded (20)

Web mining of drug reviews for market analysis

Editor's Notes