ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
Tim Essam (tessam@usaid.gov) • Laura Hughes (lhughes@usaid.gov) inspired by RStudio’s awesome Cheat Sheets (rstudio.com/resources/cheatsheets) geocenter.github.io/StataTraining updated March 2016
Disclaimer: we are not affiliated with Stata. But we like it. CC BY NC
Data Transformation
with Stata 14.1 Cheat Sheet
For more info see Stata’s reference manual (stata.com)
export delimited "myData.csv", delimiter(",") replace
export data as a comma-delimited file (.csv)
export excel "myData.xls", /*
*/ firstrow(variables) replace
export data as an Excel file (.xls) with the
variable names as the first row
Save & Export Data
save "myData.dta", replace
saveold "myData.dta", replace version(12)
save data in Stata format, replacing the data if
a file with same name exists
Stata 12-compatible file
Manipulate Strings
display trim(" leading / trailing spaces ")
remove extra spaces before and after a string
display regexr("My string", "My", "Your")
replace string1 ("My") with string2 ("Your")
display stritrim(" Too much Space")
replace consecutive spaces with a single space
display strtoname("1Var name")
convert string to Stata-compatible variable name
TRANSFORM STRINGS
display strlower("STATA should not be ALL-CAPS")
change string case; see also strupper, strproper
display strmatch("123.89", "1??.?9")
return true (1) or false (0) if string matches pattern
list make if regexm(make, "[0-9]")
list observations where make matches the regular
expression (here, records that contain a number)
FIND MATCHING STRINGS
GET STRING PROPERTIES
list if regexm(make, "(Cad.|Chev.|Datsun)")
return all observations where make contains
"Cad.", "Chev." or "Datsun"
list if inlist(word(make, 1), "Cad.", "Chev.", "Datsun")
return all observations where the first word of the
make variable contains the listed words
compare the given list against the first word in make
charlist make
display the set of unique characters within a string
* user-defined package
replace make = subinstr(make, "Cad.", "Cadillac", 1)
replace first occurrence of "Cad." with Cadillac
in the make variable
display length("This string has 29 characters")
return the length of the string
display substr("Stata", 3, 5)
return the string located between characters 3-5
display strpos("Stata", "a")
return the position in Stata where a is first found
display real("100")
convert string to a numeric or missing value
_merge code
row only
in ind2
row only
in hh2
row in
both
1
(master)
2
(using)
3
(match)
Combine Data
ADDING (APPENDING) NEW DATA
MERGING TWO DATASETS TOGETHER
FUZZY MATCHING: COMBINING TWO DATASETS WITHOUT A COMMON ID
merge 1:1 id using "ind_age.dta"
one-to-one merge of "ind_age.dta"
into the loaded dataset and create
variable "_merge" to track the origin
webuse ind_age.dta, clear
save ind_age.dta, replace
webuse ind_ag.dta, clear
merge m:1 hid using "hh2.dta"
many-to-one merge of "hh2.dta"
into the loaded dataset and create
variable "_merge" to track the origin
webuse hh2.dta, clear
save hh2.dta, replace
webuse ind2.dta, clear
append using "coffeeMaize2.dta", gen(filenum)
add observations from "coffeeMaize2.dta" to
current data and create variable "filenum" to
track the origin of each observation
webuse coffeeMaize2.dta, clear
save coffeeMaize2.dta, replace
webuse coffeeMaize.dta, clear
load demo dataid blue pink
+
id blue pink
id blue pink
should
contain
the same
variables
(columns)
MANY-TO-ONE
id blue pink id brown blue pink brown _merge
3
3
1
3
2
1
3
. .
.
.
id
+ =
ONE-TO-ONE
id blue pink id brown blue pink brownid _merge
3
3
3
+ =
must contain a
common variable
(id)
match records from different data sets using probabilistic matchingreclink
create distance measure for similarity between two strings
ssc install reclink
ssc install jarowinklerjarowinkler
Reshape Data
webuse set https://github.com/GeoCenter/StataTraining/raw/master/Day2/Data
webuse "coffeeMaize.dta" load demo dataset
xpose, clear varname
transpose rows and columns of data, clearing the data and saving
old column names as a new variable called "_varname"
MELT DATA (WIDE → LONG)
reshape long coffee@ maize@, i(country) j(year)
convert a wide dataset to long
reshape variables starting
with coffee and maize
unique id
variable (key)
create new variable which captures
the info in the column names
CAST DATA (LONG → WIDE)
reshape wide coffee maize, i(country) j(year)
convert a long dataset to wide
create new variables named
coffee2011, maize2012...
what will be
unique id
variable (key)
create new variables
with the year added
to the column name
When datasets are
tidy, they have a
c o n s i s t e n t ,
standard format
that is easier to
manipulate and
analyze.
country
coffee
2011
coffee
2012
maize
2011
maize
2012
Malawi
Rwanda
Uganda cast
melt
Rwanda
Uganda
Malawi
Malawi
Rwanda
Uganda 2012
2011
2011
2012
2011
2012
year coffee maizecountry
WIDE LONG (TIDY) TIDY DATASETS have
each observation
in its own row and
each variable in its
own column.
new variable
Label Data
label list
list all labels within the dataset
label define myLabel 0 "US" 1 "Not US"
label values foreign myLabel
define a label and apply it the values in foreign
Value labels map string descriptions to numers. They allow the
underlying data to be numeric (making logical tests simpler)
while also connecting the values to human-understandable text.
Replace Parts of Data
rename (rep78 foreign) (repairRecord carType)
rename one or multiple variables
CHANGE COLUMN NAMES
recode price (0 / 5000 = 5000)
change all prices less than 5000 to be $5,000
recode foreign (0 = 2 "US")(1 = 1 "Not US"), gen(foreign2)
change the values and value labels then store in a new
variable, foreign2
CHANGE ROW VALUES
useful for exporting datamvencode _all, mv(9999)
replace missing values with the number 9999 for all variables
mvdecode _all, mv(9999)
replace the number 9999 with missing value in all variables
useful for cleaning survey datasets
REPLACE MISSING VALUES
replace price = 5000 if price < 5000
replace all values of price that are less than $5,000 with 5000
Select Parts of Data (Subsetting)
FILTER SPECIFIC ROWS
drop in 1/4drop if mpg < 20
drop observations based on a condition (left)
or rows 1-4 (right)
keep in 1/30
opposite of drop; keep only rows 1-30
keep if inlist(make, "Honda Accord", "Honda Civic", "Subaru")
keep the specified values of make
keep if inrange(price, 5000, 10000)
keep values of price between $5,000 – $10,000 (inclusive)
sample 25
sample 25% of the observations in the dataset
(use set seed # command for reproducible sampling)
SELECT SPECIFIC COLUMNS
drop make
remove the 'make' variable
keep make price
opposite of drop; keep only columns 'make' and 'price'

More Related Content

What's hot (19)

3 R Tutorial Data Structure
3 R Tutorial Data Structure3 R Tutorial Data Structure
3 R Tutorial Data Structure
Sakthi Dasans
Ìý
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on r
Abhik Seal
Ìý
Big Data Mining in Indian Economic Survey 2017
Big Data Mining in Indian Economic Survey 2017Big Data Mining in Indian Economic Survey 2017
Big Data Mining in Indian Economic Survey 2017
Parth Khare
Ìý
Data handling in r
Data handling in rData handling in r
Data handling in r
Abhik Seal
Ìý
4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function
Sakthi Dasans
Ìý
R language introduction
R language introductionR language introduction
R language introduction
Shashwat Shriparv
Ìý
R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
Avjinder (Avi) Kaler
Ìý
Data manipulation with dplyr
Data manipulation with dplyrData manipulation with dplyr
Data manipulation with dplyr
Romain Francois
Ìý
3. R- list and data frame
3. R- list and data frame3. R- list and data frame
3. R- list and data frame
krishna singh
Ìý
Next Generation Programming in R
Next Generation Programming in RNext Generation Programming in R
Next Generation Programming in R
Florian Uhlitz
Ìý
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
Khaled Al-Shamaa
Ìý
R Programming: Importing Data In R
R Programming: Importing Data In RR Programming: Importing Data In R
R Programming: Importing Data In R
Rsquared Academy
Ìý
R gráfico
R gráficoR gráfico
R gráfico
stryper1968
Ìý
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
Julian Hyde
Ìý
Mysql
MysqlMysql
Mysql
merlin deepika
Ìý
Mysql
MysqlMysql
Mysql
Yuvaraja Rajenderan
Ìý
R Cheat Sheet – Data Management
R Cheat Sheet – Data ManagementR Cheat Sheet – Data Management
R Cheat Sheet – Data Management
Dr. Volkan OBAN
Ìý
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in R
Jeffrey Breen
Ìý
R Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RR Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In R
Rsquared Academy
Ìý
3 R Tutorial Data Structure
3 R Tutorial Data Structure3 R Tutorial Data Structure
3 R Tutorial Data Structure
Sakthi Dasans
Ìý
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on r
Abhik Seal
Ìý
Big Data Mining in Indian Economic Survey 2017
Big Data Mining in Indian Economic Survey 2017Big Data Mining in Indian Economic Survey 2017
Big Data Mining in Indian Economic Survey 2017
Parth Khare
Ìý
Data handling in r
Data handling in rData handling in r
Data handling in r
Abhik Seal
Ìý
4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function
Sakthi Dasans
Ìý
R language introduction
R language introductionR language introduction
R language introduction
Shashwat Shriparv
Ìý
R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
Avjinder (Avi) Kaler
Ìý
Data manipulation with dplyr
Data manipulation with dplyrData manipulation with dplyr
Data manipulation with dplyr
Romain Francois
Ìý
3. R- list and data frame
3. R- list and data frame3. R- list and data frame
3. R- list and data frame
krishna singh
Ìý
Next Generation Programming in R
Next Generation Programming in RNext Generation Programming in R
Next Generation Programming in R
Florian Uhlitz
Ìý
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
Khaled Al-Shamaa
Ìý
R Programming: Importing Data In R
R Programming: Importing Data In RR Programming: Importing Data In R
R Programming: Importing Data In R
Rsquared Academy
Ìý
R gráfico
R gráficoR gráfico
R gráfico
stryper1968
Ìý
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
Julian Hyde
Ìý
R Cheat Sheet – Data Management
R Cheat Sheet – Data ManagementR Cheat Sheet – Data Management
R Cheat Sheet – Data Management
Dr. Volkan OBAN
Ìý
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in R
Jeffrey Breen
Ìý
R Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RR Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In R
Rsquared Academy
Ìý

Similar to Stata cheat sheet: data transformation (20)

Cheat Sheet for Stata v15.00 PDF Complete
Cheat Sheet for Stata v15.00 PDF CompleteCheat Sheet for Stata v15.00 PDF Complete
Cheat Sheet for Stata v15.00 PDF Complete
TsamaraLuthfia1
Ìý
Data Management in R
Data Management in RData Management in R
Data Management in R
Sankhya_Analytics
Ìý
Chapter 4 Structured Query Language
Chapter 4 Structured Query LanguageChapter 4 Structured Query Language
Chapter 4 Structured Query Language
Eddyzulham Mahluzydde
Ìý
INTRODUCTION TO STATA.pptx
INTRODUCTION TO STATA.pptxINTRODUCTION TO STATA.pptx
INTRODUCTION TO STATA.pptx
Dhananjaykumar464035
Ìý
Python Pandas
Python PandasPython Pandas
Python Pandas
Sunil OS
Ìý
Db1 lecture4
Db1 lecture4Db1 lecture4
Db1 lecture4
Sherif Gad
Ìý
Transpose and manipulate character strings
Transpose and manipulate character strings Transpose and manipulate character strings
Transpose and manipulate character strings
Rupak Roy
Ìý
Mysql
MysqlMysql
Mysql
HAINIRMALRAJ
Ìý
Stata cheatsheet programming
Stata cheatsheet programmingStata cheatsheet programming
Stata cheatsheet programming
Tim Essam
Ìý
Sas cheat
Sas cheatSas cheat
Sas cheat
imaduddin91
Ìý
BP208 Fabulous Feats with @Formula
BP208 Fabulous Feats with @FormulaBP208 Fabulous Feats with @Formula
BP208 Fabulous Feats with @Formula
Kathy Brown
Ìý
Disconnected Architecture and Crystal report in VB.NET
Disconnected Architecture and Crystal report in VB.NETDisconnected Architecture and Crystal report in VB.NET
Disconnected Architecture and Crystal report in VB.NET
Everywhere
Ìý
Mysql1
Mysql1Mysql1
Mysql1
rajikaa
Ìý
Introduction to MySQL - Part 2
Introduction to MySQL - Part 2Introduction to MySQL - Part 2
Introduction to MySQL - Part 2
webhostingguy
Ìý
6.1\9 SSIS 2008R2_Training - DataFlow Transformations
6.1\9 SSIS 2008R2_Training - DataFlow Transformations6.1\9 SSIS 2008R2_Training - DataFlow Transformations
6.1\9 SSIS 2008R2_Training - DataFlow Transformations
Pramod Singla
Ìý
Pandas cheat sheet_data science
Pandas cheat sheet_data sciencePandas cheat sheet_data science
Pandas cheat sheet_data science
Subrata Shaw
Ìý
Pandas Cheat Sheet
Pandas Cheat SheetPandas Cheat Sheet
Pandas Cheat Sheet
ACASH1011
Ìý
Pandas cheat sheet
Pandas cheat sheetPandas cheat sheet
Pandas cheat sheet
Lenis Carolina Lopez
Ìý
19 tables
19 tables19 tables
19 tables
Hadley Wickham
Ìý
Data Wrangling with Pandas
Data Wrangling with PandasData Wrangling with Pandas
Data Wrangling with Pandas
Luis Carrasco
Ìý
Cheat Sheet for Stata v15.00 PDF Complete
Cheat Sheet for Stata v15.00 PDF CompleteCheat Sheet for Stata v15.00 PDF Complete
Cheat Sheet for Stata v15.00 PDF Complete
TsamaraLuthfia1
Ìý
Chapter 4 Structured Query Language
Chapter 4 Structured Query LanguageChapter 4 Structured Query Language
Chapter 4 Structured Query Language
Eddyzulham Mahluzydde
Ìý
Python Pandas
Python PandasPython Pandas
Python Pandas
Sunil OS
Ìý
Db1 lecture4
Db1 lecture4Db1 lecture4
Db1 lecture4
Sherif Gad
Ìý
Transpose and manipulate character strings
Transpose and manipulate character strings Transpose and manipulate character strings
Transpose and manipulate character strings
Rupak Roy
Ìý
Stata cheatsheet programming
Stata cheatsheet programmingStata cheatsheet programming
Stata cheatsheet programming
Tim Essam
Ìý
BP208 Fabulous Feats with @Formula
BP208 Fabulous Feats with @FormulaBP208 Fabulous Feats with @Formula
BP208 Fabulous Feats with @Formula
Kathy Brown
Ìý
Disconnected Architecture and Crystal report in VB.NET
Disconnected Architecture and Crystal report in VB.NETDisconnected Architecture and Crystal report in VB.NET
Disconnected Architecture and Crystal report in VB.NET
Everywhere
Ìý
Mysql1
Mysql1Mysql1
Mysql1
rajikaa
Ìý
Introduction to MySQL - Part 2
Introduction to MySQL - Part 2Introduction to MySQL - Part 2
Introduction to MySQL - Part 2
webhostingguy
Ìý
6.1\9 SSIS 2008R2_Training - DataFlow Transformations
6.1\9 SSIS 2008R2_Training - DataFlow Transformations6.1\9 SSIS 2008R2_Training - DataFlow Transformations
6.1\9 SSIS 2008R2_Training - DataFlow Transformations
Pramod Singla
Ìý
Pandas cheat sheet_data science
Pandas cheat sheet_data sciencePandas cheat sheet_data science
Pandas cheat sheet_data science
Subrata Shaw
Ìý
Pandas Cheat Sheet
Pandas Cheat SheetPandas Cheat Sheet
Pandas Cheat Sheet
ACASH1011
Ìý
Data Wrangling with Pandas
Data Wrangling with PandasData Wrangling with Pandas
Data Wrangling with Pandas
Luis Carrasco
Ìý

Recently uploaded (20)

Presentation1.pptx for data and table analysis
Presentation1.pptx for data and table analysisPresentation1.pptx for data and table analysis
Presentation1.pptx for data and table analysis
vatsalsingla4
Ìý
Updated Willow 2025 Media Deck_Updated010325.pdf
Updated Willow 2025 Media Deck_Updated010325.pdfUpdated Willow 2025 Media Deck_Updated010325.pdf
Updated Willow 2025 Media Deck_Updated010325.pdf
tangramcommunication
Ìý
Optimizing Common Table Expressions in Apache Hive with Calcite
Optimizing Common Table Expressions in Apache Hive with CalciteOptimizing Common Table Expressions in Apache Hive with Calcite
Optimizing Common Table Expressions in Apache Hive with Calcite
Stamatis Zampetakis
Ìý
Introduction to Java Programming for High School by ºÝºÝߣsgo.pptx
Introduction to Java Programming for High School by ºÝºÝߣsgo.pptxIntroduction to Java Programming for High School by ºÝºÝߣsgo.pptx
Introduction to Java Programming for High School by ºÝºÝߣsgo.pptx
mirhuzaifahali
Ìý
Lesson 9- Data Governance and Ethics.pptx
Lesson 9- Data Governance and Ethics.pptxLesson 9- Data Governance and Ethics.pptx
Lesson 9- Data Governance and Ethics.pptx
1045858
Ìý
The Marketability of Rice Straw Yarn Among Selected Customers of Gantsilyo Guru
The Marketability of Rice Straw Yarn Among Selected Customers of Gantsilyo GuruThe Marketability of Rice Straw Yarn Among Selected Customers of Gantsilyo Guru
The Marketability of Rice Straw Yarn Among Selected Customers of Gantsilyo Guru
kenyoncenteno12
Ìý
Cost sheet. with basics and formats of sheet
Cost sheet. with basics and formats of sheetCost sheet. with basics and formats of sheet
Cost sheet. with basics and formats of sheet
supreetk82004
Ìý
CloudMonitor - Architecture Audit Review February 2025.pdf
CloudMonitor - Architecture Audit Review February 2025.pdfCloudMonitor - Architecture Audit Review February 2025.pdf
CloudMonitor - Architecture Audit Review February 2025.pdf
Rodney Joyce
Ìý
100680-05-Eucharist_Orientation_Sessions.pdf
100680-05-Eucharist_Orientation_Sessions.pdf100680-05-Eucharist_Orientation_Sessions.pdf
100680-05-Eucharist_Orientation_Sessions.pdf
jacobdivina9
Ìý
vnptloveeeeeeeeeeeeeeeeeeeeeeeeeeee.pptx
vnptloveeeeeeeeeeeeeeeeeeeeeeeeeeee.pptxvnptloveeeeeeeeeeeeeeeeeeeeeeeeeeee.pptx
vnptloveeeeeeeeeeeeeeeeeeeeeeeeeeee.pptx
deomom129
Ìý
The Role of Christopher Campos Orlando in Sustainability Analytics
The Role of Christopher Campos Orlando in Sustainability AnalyticsThe Role of Christopher Campos Orlando in Sustainability Analytics
The Role of Christopher Campos Orlando in Sustainability Analytics
christophercamposus1
Ìý
data mining tools.pptxvdvjdggmgmgelmgleg
data mining tools.pptxvdvjdggmgmgelmglegdata mining tools.pptxvdvjdggmgmgelmgleg
data mining tools.pptxvdvjdggmgmgelmgleg
1052LaxmanrajS
Ìý
Stasiun kernel pengolahan kelapa sawit indonesia
Stasiun kernel pengolahan kelapa sawit indonesiaStasiun kernel pengolahan kelapa sawit indonesia
Stasiun kernel pengolahan kelapa sawit indonesia
fikrimanurung1
Ìý
Valkey 101 - SCaLE 22x March 2025 Stokes.pdf
Valkey 101 - SCaLE 22x March 2025 Stokes.pdfValkey 101 - SCaLE 22x March 2025 Stokes.pdf
Valkey 101 - SCaLE 22x March 2025 Stokes.pdf
Dave Stokes
Ìý
Kaggle & Datathons: A Practical Guide to AI Competitions
Kaggle & Datathons: A Practical Guide to AI CompetitionsKaggle & Datathons: A Practical Guide to AI Competitions
Kaggle & Datathons: A Practical Guide to AI Competitions
rasheedsrq
Ìý
Introduction Lecture 01 Data Science.pdf
Introduction Lecture 01 Data Science.pdfIntroduction Lecture 01 Data Science.pdf
Introduction Lecture 01 Data Science.pdf
messagetome133
Ìý
643663189-Q4W3-Synthesize-Information-1-pptx.pptx
643663189-Q4W3-Synthesize-Information-1-pptx.pptx643663189-Q4W3-Synthesize-Information-1-pptx.pptx
643663189-Q4W3-Synthesize-Information-1-pptx.pptx
rossanthonytan130
Ìý
iam free indeed.pptxiam free indeed.pptx
iam free indeed.pptxiam free indeed.pptxiam free indeed.pptxiam free indeed.pptx
iam free indeed.pptxiam free indeed.pptx
muhweziart
Ìý
Lecture-AI and Alogor Parallel Aglorithms.pptx
Lecture-AI and Alogor Parallel Aglorithms.pptxLecture-AI and Alogor Parallel Aglorithms.pptx
Lecture-AI and Alogor Parallel Aglorithms.pptx
humairafatima22
Ìý
19th Edition Of International Research Data Analysis Excellence Awards
19th Edition Of International Research Data Analysis Excellence Awards19th Edition Of International Research Data Analysis Excellence Awards
19th Edition Of International Research Data Analysis Excellence Awards
dataanalysisconferen
Ìý
Presentation1.pptx for data and table analysis
Presentation1.pptx for data and table analysisPresentation1.pptx for data and table analysis
Presentation1.pptx for data and table analysis
vatsalsingla4
Ìý
Updated Willow 2025 Media Deck_Updated010325.pdf
Updated Willow 2025 Media Deck_Updated010325.pdfUpdated Willow 2025 Media Deck_Updated010325.pdf
Updated Willow 2025 Media Deck_Updated010325.pdf
tangramcommunication
Ìý
Optimizing Common Table Expressions in Apache Hive with Calcite
Optimizing Common Table Expressions in Apache Hive with CalciteOptimizing Common Table Expressions in Apache Hive with Calcite
Optimizing Common Table Expressions in Apache Hive with Calcite
Stamatis Zampetakis
Ìý
Introduction to Java Programming for High School by ºÝºÝߣsgo.pptx
Introduction to Java Programming for High School by ºÝºÝߣsgo.pptxIntroduction to Java Programming for High School by ºÝºÝߣsgo.pptx
Introduction to Java Programming for High School by ºÝºÝߣsgo.pptx
mirhuzaifahali
Ìý
Lesson 9- Data Governance and Ethics.pptx
Lesson 9- Data Governance and Ethics.pptxLesson 9- Data Governance and Ethics.pptx
Lesson 9- Data Governance and Ethics.pptx
1045858
Ìý
The Marketability of Rice Straw Yarn Among Selected Customers of Gantsilyo Guru
The Marketability of Rice Straw Yarn Among Selected Customers of Gantsilyo GuruThe Marketability of Rice Straw Yarn Among Selected Customers of Gantsilyo Guru
The Marketability of Rice Straw Yarn Among Selected Customers of Gantsilyo Guru
kenyoncenteno12
Ìý
Cost sheet. with basics and formats of sheet
Cost sheet. with basics and formats of sheetCost sheet. with basics and formats of sheet
Cost sheet. with basics and formats of sheet
supreetk82004
Ìý
CloudMonitor - Architecture Audit Review February 2025.pdf
CloudMonitor - Architecture Audit Review February 2025.pdfCloudMonitor - Architecture Audit Review February 2025.pdf
CloudMonitor - Architecture Audit Review February 2025.pdf
Rodney Joyce
Ìý
100680-05-Eucharist_Orientation_Sessions.pdf
100680-05-Eucharist_Orientation_Sessions.pdf100680-05-Eucharist_Orientation_Sessions.pdf
100680-05-Eucharist_Orientation_Sessions.pdf
jacobdivina9
Ìý
vnptloveeeeeeeeeeeeeeeeeeeeeeeeeeee.pptx
vnptloveeeeeeeeeeeeeeeeeeeeeeeeeeee.pptxvnptloveeeeeeeeeeeeeeeeeeeeeeeeeeee.pptx
vnptloveeeeeeeeeeeeeeeeeeeeeeeeeeee.pptx
deomom129
Ìý
The Role of Christopher Campos Orlando in Sustainability Analytics
The Role of Christopher Campos Orlando in Sustainability AnalyticsThe Role of Christopher Campos Orlando in Sustainability Analytics
The Role of Christopher Campos Orlando in Sustainability Analytics
christophercamposus1
Ìý
data mining tools.pptxvdvjdggmgmgelmgleg
data mining tools.pptxvdvjdggmgmgelmglegdata mining tools.pptxvdvjdggmgmgelmgleg
data mining tools.pptxvdvjdggmgmgelmgleg
1052LaxmanrajS
Ìý
Stasiun kernel pengolahan kelapa sawit indonesia
Stasiun kernel pengolahan kelapa sawit indonesiaStasiun kernel pengolahan kelapa sawit indonesia
Stasiun kernel pengolahan kelapa sawit indonesia
fikrimanurung1
Ìý
Valkey 101 - SCaLE 22x March 2025 Stokes.pdf
Valkey 101 - SCaLE 22x March 2025 Stokes.pdfValkey 101 - SCaLE 22x March 2025 Stokes.pdf
Valkey 101 - SCaLE 22x March 2025 Stokes.pdf
Dave Stokes
Ìý
Kaggle & Datathons: A Practical Guide to AI Competitions
Kaggle & Datathons: A Practical Guide to AI CompetitionsKaggle & Datathons: A Practical Guide to AI Competitions
Kaggle & Datathons: A Practical Guide to AI Competitions
rasheedsrq
Ìý
Introduction Lecture 01 Data Science.pdf
Introduction Lecture 01 Data Science.pdfIntroduction Lecture 01 Data Science.pdf
Introduction Lecture 01 Data Science.pdf
messagetome133
Ìý
643663189-Q4W3-Synthesize-Information-1-pptx.pptx
643663189-Q4W3-Synthesize-Information-1-pptx.pptx643663189-Q4W3-Synthesize-Information-1-pptx.pptx
643663189-Q4W3-Synthesize-Information-1-pptx.pptx
rossanthonytan130
Ìý
iam free indeed.pptxiam free indeed.pptx
iam free indeed.pptxiam free indeed.pptxiam free indeed.pptxiam free indeed.pptx
iam free indeed.pptxiam free indeed.pptx
muhweziart
Ìý
Lecture-AI and Alogor Parallel Aglorithms.pptx
Lecture-AI and Alogor Parallel Aglorithms.pptxLecture-AI and Alogor Parallel Aglorithms.pptx
Lecture-AI and Alogor Parallel Aglorithms.pptx
humairafatima22
Ìý
19th Edition Of International Research Data Analysis Excellence Awards
19th Edition Of International Research Data Analysis Excellence Awards19th Edition Of International Research Data Analysis Excellence Awards
19th Edition Of International Research Data Analysis Excellence Awards
dataanalysisconferen
Ìý

Stata cheat sheet: data transformation

  • 1. Tim Essam (tessam@usaid.gov) • Laura Hughes (lhughes@usaid.gov) inspired by RStudio’s awesome Cheat Sheets (rstudio.com/resources/cheatsheets) geocenter.github.io/StataTraining updated March 2016 Disclaimer: we are not affiliated with Stata. But we like it. CC BY NC Data Transformation with Stata 14.1 Cheat Sheet For more info see Stata’s reference manual (stata.com) export delimited "myData.csv", delimiter(",") replace export data as a comma-delimited file (.csv) export excel "myData.xls", /* */ firstrow(variables) replace export data as an Excel file (.xls) with the variable names as the first row Save & Export Data save "myData.dta", replace saveold "myData.dta", replace version(12) save data in Stata format, replacing the data if a file with same name exists Stata 12-compatible file Manipulate Strings display trim(" leading / trailing spaces ") remove extra spaces before and after a string display regexr("My string", "My", "Your") replace string1 ("My") with string2 ("Your") display stritrim(" Too much Space") replace consecutive spaces with a single space display strtoname("1Var name") convert string to Stata-compatible variable name TRANSFORM STRINGS display strlower("STATA should not be ALL-CAPS") change string case; see also strupper, strproper display strmatch("123.89", "1??.?9") return true (1) or false (0) if string matches pattern list make if regexm(make, "[0-9]") list observations where make matches the regular expression (here, records that contain a number) FIND MATCHING STRINGS GET STRING PROPERTIES list if regexm(make, "(Cad.|Chev.|Datsun)") return all observations where make contains "Cad.", "Chev." or "Datsun" list if inlist(word(make, 1), "Cad.", "Chev.", "Datsun") return all observations where the first word of the make variable contains the listed words compare the given list against the first word in make charlist make display the set of unique characters within a string * user-defined package replace make = subinstr(make, "Cad.", "Cadillac", 1) replace first occurrence of "Cad." with Cadillac in the make variable display length("This string has 29 characters") return the length of the string display substr("Stata", 3, 5) return the string located between characters 3-5 display strpos("Stata", "a") return the position in Stata where a is first found display real("100") convert string to a numeric or missing value _merge code row only in ind2 row only in hh2 row in both 1 (master) 2 (using) 3 (match) Combine Data ADDING (APPENDING) NEW DATA MERGING TWO DATASETS TOGETHER FUZZY MATCHING: COMBINING TWO DATASETS WITHOUT A COMMON ID merge 1:1 id using "ind_age.dta" one-to-one merge of "ind_age.dta" into the loaded dataset and create variable "_merge" to track the origin webuse ind_age.dta, clear save ind_age.dta, replace webuse ind_ag.dta, clear merge m:1 hid using "hh2.dta" many-to-one merge of "hh2.dta" into the loaded dataset and create variable "_merge" to track the origin webuse hh2.dta, clear save hh2.dta, replace webuse ind2.dta, clear append using "coffeeMaize2.dta", gen(filenum) add observations from "coffeeMaize2.dta" to current data and create variable "filenum" to track the origin of each observation webuse coffeeMaize2.dta, clear save coffeeMaize2.dta, replace webuse coffeeMaize.dta, clear load demo dataid blue pink + id blue pink id blue pink should contain the same variables (columns) MANY-TO-ONE id blue pink id brown blue pink brown _merge 3 3 1 3 2 1 3 . . . . id + = ONE-TO-ONE id blue pink id brown blue pink brownid _merge 3 3 3 + = must contain a common variable (id) match records from different data sets using probabilistic matchingreclink create distance measure for similarity between two strings ssc install reclink ssc install jarowinklerjarowinkler Reshape Data webuse set https://github.com/GeoCenter/StataTraining/raw/master/Day2/Data webuse "coffeeMaize.dta" load demo dataset xpose, clear varname transpose rows and columns of data, clearing the data and saving old column names as a new variable called "_varname" MELT DATA (WIDE → LONG) reshape long coffee@ maize@, i(country) j(year) convert a wide dataset to long reshape variables starting with coffee and maize unique id variable (key) create new variable which captures the info in the column names CAST DATA (LONG → WIDE) reshape wide coffee maize, i(country) j(year) convert a long dataset to wide create new variables named coffee2011, maize2012... what will be unique id variable (key) create new variables with the year added to the column name When datasets are tidy, they have a c o n s i s t e n t , standard format that is easier to manipulate and analyze. country coffee 2011 coffee 2012 maize 2011 maize 2012 Malawi Rwanda Uganda cast melt Rwanda Uganda Malawi Malawi Rwanda Uganda 2012 2011 2011 2012 2011 2012 year coffee maizecountry WIDE LONG (TIDY) TIDY DATASETS have each observation in its own row and each variable in its own column. new variable Label Data label list list all labels within the dataset label define myLabel 0 "US" 1 "Not US" label values foreign myLabel define a label and apply it the values in foreign Value labels map string descriptions to numers. They allow the underlying data to be numeric (making logical tests simpler) while also connecting the values to human-understandable text. Replace Parts of Data rename (rep78 foreign) (repairRecord carType) rename one or multiple variables CHANGE COLUMN NAMES recode price (0 / 5000 = 5000) change all prices less than 5000 to be $5,000 recode foreign (0 = 2 "US")(1 = 1 "Not US"), gen(foreign2) change the values and value labels then store in a new variable, foreign2 CHANGE ROW VALUES useful for exporting datamvencode _all, mv(9999) replace missing values with the number 9999 for all variables mvdecode _all, mv(9999) replace the number 9999 with missing value in all variables useful for cleaning survey datasets REPLACE MISSING VALUES replace price = 5000 if price < 5000 replace all values of price that are less than $5,000 with 5000 Select Parts of Data (Subsetting) FILTER SPECIFIC ROWS drop in 1/4drop if mpg < 20 drop observations based on a condition (left) or rows 1-4 (right) keep in 1/30 opposite of drop; keep only rows 1-30 keep if inlist(make, "Honda Accord", "Honda Civic", "Subaru") keep the specified values of make keep if inrange(price, 5000, 10000) keep values of price between $5,000 – $10,000 (inclusive) sample 25 sample 25% of the observations in the dataset (use set seed # command for reproducible sampling) SELECT SPECIFIC COLUMNS drop make remove the 'make' variable keep make price opposite of drop; keep only columns 'make' and 'price'