�ݺ�ߣ

CADX150
Healthcare Analytics
Module III
Dr.Jose Reena K

• What is a database?
• A database is essentially a collection of data that has been arranged into a
structure that allows it to be easily read, edited, added or deleted.
• What is a database management system?
• Database Management System is a system designed to manage the
automatic and orderly database. The Database Management System is an
automatic system helping the user to control information, create, update
and maintain the database.
• What do databases do in healthcare?
• Healthcare databases help individual medical organizations understand their
daily activities and their place within the larger healthcare industry. This
allows healthcare professionals to make decisions about how they run their
businesses, the work they do, and the systems they use to manage their
operations. With the rapid advancement of healthcare technology and post-
COVID-19 changes in healthcare operations, it is more vital than ever for
healthcare databases to be organized, well-maintained, and simple to use.

• Why is database management important?
• Healthcare databases have been an essential component of understanding and
improving critical care worldwide. The importance of database technology in healthcare
cannot be overstated. The Healthcare database system generates data that requires
delicate handling and is developed for the purpose of assessing the quality of healthcare,
often for a specific disease or within a specific healthcare delivery system. databases in
healthcare can promote:
• 1. Assessing the quality of healthcare
• Hospitals, health departments, local, state and federal agencies, to name a few, all
contribute information to healthcare databases. Healthcare specialists can gain a
thorough understanding of the quality of current healthcare operations by analyzing
aspects such as; financing available to healthcare facilities, service availability and
affordability, use of innovation, and barriers to seeking care.
• 2. Tracking and Monitoring
• Medical workers can utilize reporting and logging tools to keep track of operations with
the help of healthcare databases. these aids healthcare providers in monitoring and
improving the quality of patient treatment by providing an important tool for tracking
healthcare use. Healthcare providers, for instance, need to know whether people are
getting their vaccination or not.

Healthcare DBMS
• A database can be defined as a collection of related data (Elmasri and
Navathe, 2010).
• Databases are often subcategorized as SQL databases or NoSQL databases.
• In SQL databases, data is recorded in tables and consists of rows and
columns. The related data may be distributed across several tables in a
trade-off between efficient storage and convenience.
• The database management system (DBMS) is a piece of software that
enables the database to serve several functions. For one thing, it allows for
the retrieval of data using the SQL language (for SQL databases). Another
function is to update the data when needed, also using SQL. Additional
functions of a DBMS include protecting and securing the data.

Data engineering with SQL – an example case
• For this chapter, let's pretend you secured a predictive analytics assignment
with a cardiology practice located in the United States. The practice wants
you to predict which patients are at risk of dying within 6 months of their
visit to the clinic. They make their data available to you in the form of a
database that includes six tables.
• For simplicity, we truncate the database to include the information for five
patients only. Our task is to manipulate the data using the SQL language to
consolidate it into a single table so that it can be used for machine learning.
We will first go over the patients in the database and the database structure.
Then, we will introduce basic SQL concepts for engineering and manipulate
the data into a form amenable to machine learning.

Case details – predicting mortality for a cardiology practice
• The cardiology practice you are working with has two physicians on
staff: Dr. Johnson and Dr. Wu. While the practice has many patients,
they are interested in identifying which patients who visit are at high
risk of all-cause mortality within the next 6 months.
• Now that we've reviewed the details of the modeling assignment,
let's take a look at the five patients in the database. The preliminary
data sent to you by the cardiology practice includes information on
five patients, distributed across six tables.

Case details – predicting mortality for a cardiology practice
The following is the information about the patients:
• Patient ID-1: Patient #1 in the database is a 65-year-old male who has congestive heart failure (CHF), a chronic
condition in which the heart is unable to pump blood properly to the rest of the body. He also has hypertension
(high blood pressure), which is a risk factor for CHF. He visited his cardiologist, Dr. Johnson, on 9/1/2016 and
17/1/2016. On his January 9th visit, he was found to have an elevated BP (154/94) and an elevated B-natriuretic
peptide (BNP) lab value of 350. BNP is a marker of CHF severity. He was subsequently placed on lisinopril and
furosemide, which are first-line treatments for CHF and hypertension. Unfortunately, he passed away on May
15th, 2016.
• Patient ID-2: Patient #2 is a 39-year-old female with a history of angina pectoris (cardiovascular-related chest
pain upon exercising) and diabetes mellitus. Diabetes mellitus is a risk factor for myocardial infarction (heart
attack; a late, often fatal manifestation of atherosclerotic heart disease), and angina pectoris can be seen as an
early manifestation of atherosclerotic heart disease. She visited her cardiologist, Dr. Wu, on January 15th, 2016, at
which time she was found to have an elevated blood glucose level of 225, a sign of uncontrolled diabetes. She
was started on metformin for her diabetes, as well as nitroglycerin, aspirin, and metoprolol for her angina.
• Patient ID-3: Patient #3 is a 32-year-old female who sees Dr. Johnson for management of her hypertension.
During her visit on February 1st, 2016 her blood pressure was elevated at 161/100. She was started on
valsartan/hydrochlorothiazide, an anti-hypertensive combination.

• Patient ID: 4: Patient #4 is a 51-year-old male who has severe CHF with
pulmonary hypertension. He saw Dr. Wu on February 27th, 2016. During
that visit, his weight was 211 lbs and his blood pressure was slightly
elevated at 143/84. His BNP level was highly elevated at 1,000. He was
given lisinopril and furosemide for his CHF as well as diltiazem for his
pulmonary hypertension. Unfortunately, he passed away on June 8th, 2016.
• Patient ID-5: The last patient in our database, patient #5, is a 58-year-old
male who presented to Dr. Wu on March 1st, 2016 with a history of CHF
and diabetes mellitus Type 2. During the visit, his glucose was elevated at
318 and BNP was moderately elevated at 400. He was started on lisinopril
and furosemide for his CHF and metformin for his diabetes.

The clinical database
• Now that we've gotten to know the five patients whose information is contained in our database,
we can describe the table structure and fields contained in the database, for six mock tables:
PATIENT, VISIT, MEDICATIONS, LABS, VITALS, and MORT.
• Although every clinical database is different, I've tried to use a structure that is commonly seen in
healthcare. Typically, tables are presented by clinical domains (for an example of a research study
that received tables in such a distributed format, see Basole et al., 2015). For example, there is
often one table that contains demographic and personal information, one table for lab results,
one for medications, and so on, so that is how we constructed the database in this example. They
tend to be tied together by a common identifier, which in our case is the Pid field.
• As we describe the tables, we must keep our end-goal of the data engineering phase in mind–to
combine the relevant information from the six tables into a single table, whose columns include
the target variable (mortality in this case) in addition to predictor variables, which should be
useful for predicting the target variable. This will enable us to make a machine learning model
with popular packages such as Python's scikit-learn. With this in mind, we will highlight selected
fields that will be useful for our assignment.

The PATIENT table
• In our example, the PATIENT table, which we can see in the following
screenshot, contains the demographic and identifying information of
our patients–their names, contact information, birthdays, and
biological sex.
• In this example, there are only five observations and 11 columns; in
real practice, this table would contain all of the patients affiliated with
the healthcare organization. The number of rows in this table might
range from hundreds to hundreds of thousands, while the table could
potentially include dozens of columns containing detailed
demographic information:

• In the database, every unique patient is assigned to an identifier (the
field labeled as Pid), which in our case is simply numbered 1 - 5. The
Pid column allows us to keep track of the patients across different
tables. Also, notice that there is one and only one entry for each
distinct patient ID.

The PATIENT table
• After identifying the indispensable identifer column, the focus should be on which variables to
keep and which to discard. Certainly, age and sex are important demographic predictors of
mortality. If race were in this table, that would be another important demographic variable.
• In the database, every unique patient is assigned to an identifier (the field labeled as Pid), which in
our case is simply numbered 1 - 5. The Pid column allows us to keep track of the patients across
different tables. Also, notice that there is one and only one entry for each distinct patient ID.
• Another notable variable in this table is the zip code. Increasingly, socioeconomic data is being
used in machine learning analyses. The zip code can potentially be tied to publicly available
census data; that data can then be joined to the data in this table on the zip code and could
potentially provide information on the average education level, income, and healthcare coverage
for each patient's zip code. There are even organizations who sell household-level information;
however, with that data comes a great responsibility for privacy protection and data security. For
this example, we will omit the zip code to keep our final table simple.
• Information we'll leave out from our final table includes names, street addresses, and phone
numbers. As long as we have the patient ID, these fields shouldn't have much of a predictive
impact on our target variable.

The VISIT table
• While the PATIENT table contains basic administrative information
about each patient, our assignment is to predict the mortality risk on
the basis of each visit. The VISIT table contains one observation for
each patient visit, along with some clinical information about each visit:
• Notice that the patient ID is no longer the primary identifier of this
table, since Patient #1 had two visits; instead, there is a Visit_id field
that is numbered from 10001 to 10006 in this example, with one
distinct ID per visit.

The VISIT table
• This table also contains Visit_date. Since the cardiology practice
indicated they want to know the mortality risk within 6 months of the
patient visit, we will have to use this field later when we compute the
target variable.
• Two of the fields in this table contain ICD (diagnosis) codes. Actual
tables may contain dozens of codes for each visit. For each coded
field, there is a corresponding name field that contains the name of
the condition that the code represents.

The MEDICATIONS table
• The MEDICATIONS table contains one entry for every medication
being taken by our five patients. In this example, there is no single
column that serves as a primary key for this table. As we can see in
the following screenshot, this table includes information about the
medication name, dose, frequency, route, prescribing physician, and
prescription date. The NDC code of each medication is also included;
we covered NDC codes in Chapter 2, Healthcare Foundations:

The MEDICATIONS table
• Including medications in our final table will not be straightforward. For example, the
information in the tables does not indicate the class of each medication. The NDC
code is present, but the NDC code is even more granular than the medication name
since it includes the route of administration and dosage in making each unique code;
therefore, multiple forms of lisinopril could have different NDC codes. In order to
make a column for each medication, we could potentially separately make a table for
each medication, which contains all of the medications that compose it, and then
merge that information into our table.
• If we choose to include dosage information, that field will require some cleaning.
Notice that Patient #3 is receiving an anti-hypertensive combination drug–the
valsartan component has a dosage of 160 mg, while the hydrochlorothiazide
component has a dosage of 12.5 mg. This could possibly be coded as two separate
drugs, but creating a script that splits combination drugs into two rows is not trivial.

The LABS table
• Laboratory information is an important part of clinical diagnostics, and many laboratory test results make for good predictor
variables (Donze et al., 2013; Sahni et al., 2018). The LABS table includes fields that describe the laboratory test name,
abbreviation, LOINC code, and result:
• There are some different approaches to including lab information in the final table. One way would be to include the raw lab result
as a continuous variable. However, this leads to a problem because the result would be NULL for most labs. We could potentially
navigate around this issue by imputing a value in the normal range when it is missing. Another approach would be to have a binary
variable for a lab test result that is in the abnormal range. This solves the missing data problem, since if the result is missing it
would be zero. However, a BNP value of 1,000 (which indicates severe CHF) would be no different than a BNP value of 350
(which indicates mild CHF) with this method. We will demonstrate both approaches in this chapter.

The LABS table
• Also note that the Lab_value field sometimes contains special
characters, for example in the troponin result. These will need to be
removed and the lab values interpreted accordingly. Culture results
(not included in this example) are completely textual, often naming
specific bacterial strains instead of numbers.
• Again, we repeat that this is a simplified example and that many of
the common labs that would be drawn for these patients (for
example, WBC count, hemoglobin, sodium, potassium, and so on) are
excluded here.

The VITALS table
• Vital signs are important indicators of a patient's health status and can be good predictors in
healthcare machine learning models (Sahni et al., 2018). Vital signs are typically taken at every
patient visit, so they can easily be included in their raw (numerical) form to preserve granularity.
• In the following screenshot of the table, we notice that while height and weight are present, the
body mass index (BMI) is missing. We will demonstrate the calculation of the BMI in Chapter 5,
Computing Foundations – Introduction to Python. Second, Visit #10004 is missing a temperature
reading. This is common in healthcare and may be caused by an oversight in care:

The MORT table
• Finally, we come to the table that contains the target variable. The
MORT table contains just two fields, the patient identifier, and the
date the patient passed away. Patients not listed in this table can be
assumed to be living:

Starting an SQLite session
The database engine we will use to transform our database is SQLite. It should be mentioned that SQL
comes in many variants, and the SQL specific to SQLite has minor differences to that specific to MySQL
or SQL Server databases. However, the underlying principles remain constant across all SQL dialects.
At this time, do the following:
• Navigate to the directory containing the sqlite3.exe program in your shell or command prompt (using
the cd command).
• Type sqlite3 mortality.db and press Enter. You should see a prompt that looks like the following:
sqlite>. This prompt indicates that you are in the SQLite program.
• Throughout the remainder of this chapter, we are going to create some tables and execute some SQLite
commands on them in the SQLite program.
• To exit the session at any time, type .exit and press Enter.

Data engineering, one table at a time with SQL
sqlite> CREATE TABLE PATIENT(Pid VARCHAR(30) NOT NULL,
Fname VARCHAR(30) NOT NULL,Minit CHAR,
Lname VARCHAR(30) NOT NULL,Bdate TEXT NOT NULL,
Street VARCHAR(50),City VARCHAR(30),State VARCHAR(2),
Zip VARCHAR(5),Phone VARCHAR(10) NOT NULL,Sex CHAR,
PRIMARY KEY (Pid)
);

sqlite> INSERT INTO PATIENT (Pid, Fname, Minit, Lname, Bdate, Street, City, State, Zip, Phone, Sex)
VALUES ('1','John','A','Smith','1952-01-01','1206 Fox Hollow
Rd.','Pittsburgh','PA','15213','6789871234','M');
VALUES ('2','Candice','P','Jones','1978-02-03','1429 Orlyn Dr.','Los
Angeles','CA','90024','3107381419','F');
VALUES ('3','Regina','H','Wilson','1985-04-23','765 Chestnut
Ln.','Albany','NY','12065','5184590206','F');
VALUES ('4','Harold','','Lee','1966-11-15','2928 Policy
St.','Providence','RI','02912','6593482691','M');
VALUES ('5','Stan','P','Davis','1958-12-30','4271 12th St.','Atlanta','GA','30339','4049814933','M');

Pid VARCHAR(30) NOT NULL,
Visit_id VARCHAR(30) NOT NULL,
Visit_date DATE NOT NULL,
Attending_md VARCHAR(30) NOT NULL,
Pri_dx_icd VARCHAR(20) NOT NULL,
Pri_dx_name VARCHAR(100) NOT NULL,
Sec_dx_icd VARCHAR(20),
Sec_dx_name VARCHAR(100),
PRIMARY KEY (Visit_id)
);
sqlite> INSERT INTO VISIT (Pid, Visit_id, Visit_date, Attending_md, Pri_dx_icd, Pri_dx_name, Sec_dx_icd,
Sec_dx_name)
VALUES ('1','10001','2016-01-09','JOHNSON','I50.9','Heart failure, unspecified','I10','Essential (primary)
hypertension');
Sec_dx_name)
VALUES ('1','10002','2016-01-17','JOHNSON','I50.9','Heart failure, unspecified','I10','Essential (primary)
hypertension');
Sec_dx_name)
VALUES ('2','10003','2016-01-15','WU','I20.9','Angina pectoris, unspecified','E11.9','Type 2 diabetes mellitus
without complications');
Sec_dx_name)
VALUES ('3','10004','2016-02-01','JOHNSON','I10','Essential (primary) hypertension','','');
Sec_dx_name)
VALUES ('4','10005','2016-02-27','WU','I27.0','Primary pulmonary hypertension','I50.9','Heart failure,
unspecified');
Sec_dx_name)
VALUES ('5','10006','2016-03-01','WU','I50.9','Heart failure, unspecified','E11.9','Type 2 diabetes mellitus

sqlite> CREATE TABLE MEDICATIONS(
Rx_name VARCHAR(50) NOT NULL,
Rx_dose VARCHAR(20),
Rx_freq VARCHAR(10),
Rx_route VARCHAR(10),
Prescribing_md VARCHAR(30) NOT NULL,
Rx_date DATE NOT NULL,
Rx_ndc VARCHAR(30)
);

• sqlite> INSERT INTO MEDICATIONS (Pid, Rx_name, Rx_dose, Rx_freq, Rx_route,
Prescribing_md, Rx_date, Rx_ndc)
VALUES ('1', 'LISINOPRIL','5 mg','bid','po','JOHNSON','01/09/2016','68180-
513-01');
sqlite> INSERT INTO MEDICATIONS (Pid, Rx_name, Rx_dose, Rx_freq, Rx_route,
VALUES ('1', 'FUROSEMIDE','20 mg','bid','po','JOHNSON','01/09/2016','50742-
104-01');
VALUES ('2', 'NITROGLYCERIN','0.4 mg','tid','sl','WU','01/15/2016','59762-
3304-1');
VALUES ('2', 'METFORMIN','500 mg','bid','po','WU','01/15/2016','65162-175-
10');

sqlite> CREATE TABLE LABS(
Lab_name VARCHAR(50),
Lab_abbrev VARCHAR(20),
Lab_loinc VARCHAR(10) NOT NULL,
Lab_value VARCHAR(20) NOT NULL,
Ordering_md VARCHAR(30),
Lab_date DATE NOT NULL
);

• sqlite> INSERT INTO LABS (Pid, Lab_name, Lab_abbrev, Lab_loinc,
Lab_value, Ordering_md, Lab_date)
VALUES ('1','Natriuretic peptide B','BNP','42637-
9','350','JOHNSON','2016-01-09');
sqlite> INSERT INTO LABS (Pid, Lab_name, Lab_abbrev, Lab_loinc,
VALUES ('2','Natriuretic peptide B','BNP','42637-
9','100','WU','2016-01-15');
VALUES ('2','Glucose','GLU','2345-7','225','WU','2016-01-15');
VALUES ('2','Troponin I','TROP','10839-9','<0.004','WU','2016-01-
15');

sqlite> CREATE TABLE VITALS(
Visit_id VARCHAR(30) NOT NULL,
Height_in INT,
Weight_lb FLOAT,
Temp_f FLOAT,
Pulse INT,
Resp_rate INT,
Bp_syst INT,
Bp_diast INT,
SpO2 INT
);

• sqlite> INSERT INTO VITALS (Pid, Visit_id, Height_in, Weight_lb, Temp_f,
Pulse, Resp_rate, Bp_syst, Bp_diast, SpO2)
VALUES ('1','10001',70,188.4,98.6,95,18,154,94,97);
sqlite> INSERT INTO VITALS (Pid, Visit_id, Height_in, Weight_lb, Temp_f,
VALUES ('1','10002',70,188.4,99.1,85,17,157,96,100);
VALUES ('2','10003',63,130.2,98.7,82,16,120,81,100);
VALUES ('3','10004',65,120.0,NULL,100,19,161,100,98);
VALUES ('4','10005',66,211.4,98.2,95,19,143,84,93);
VALUES ('5','10006',69,150.0,97.6,77,18,130,86,99);

sqlite> CREATE TABLE MORT(
Mortality_date DATE NOT NULL,
PRIMARY KEY (Pid)
);
sqlite> INSERT INTO MORT (Pid,
Mortality_date)
VALUES ('1', '2016-05-15');
sqlite> INSERT INTO MORT (Pid,
Mortality_date)

sqlite> SELECT * FROM
PATIENT;

sqlite> CREATE TABLE MORT_FINAL AS
SELECT Visit_id, Pid, Attending_md, Visit_date, Pri_dx_icd,
Sec_dx_icd
FROM VISIT;

�ݺ�ߣ

Health care Analytics-Module 3-CADX150..

Recommended

More Related Content

Similar to Health care Analytics-Module 3-CADX150.. (20)

More from ssuserec53e73 (6)

Recently uploaded (20)

Health care Analytics-Module 3-CADX150..