Desk reference for data transformation in Stata. Co-authored with Tim Essam (@StataRGIS, linkedin.com/in/timessam). See all cheat sheets at http://bit.ly/statacheatsheets. Updated 2016/06/03.
This document provides a cheat sheet for frequently used commands in Stata for data processing, exploration, transformation, and management. It highlights commands for viewing and summarizing data, importing and exporting data, string manipulation, merging datasets, and more. Keyboard shortcuts for navigating Stata are also included.
Stata cheat sheet: programming. Co-authored with Tim Essam (linkedin.com/in/timessam). See all cheat sheets at http://bit.ly/statacheatsheets. Updated 2016/06/04
Desk reference for data wrangling, analysis, visualization, and programming in Stata. Co-authored with Tim Essam(@StataRGIS, linkedin.com/in/timessam). See all cheat sheets at http://bit.ly/statacheatsheets. Updated 2016/06/03
This document provides examples of Stata commands for specifying different types of variables and indicators in regression models, as well as changing the base category and interaction terms. It demonstrates how to specify indicator variables, set the base category, treat variables as continuous or create interaction and factorial interaction terms in regressions.
Grouped data frames allow dplyr functions to manipulate each group separately. The group_by() function creates a grouped data frame, while ungroup() removes grouping. Summarise() applies summary functions to columns to create a new table, such as mean() or count(). Join functions combine tables by matching values. Left, right, inner, and full joins retain different combinations of values from the tables.
This document provides an overview of using sparklyr to perform data science tasks with Apache Spark. It discusses how sparklyr allows users to import data, transform it using dplyr verbs and Spark SQL, build and evaluate models using Spark MLlib and H2O, and visualize results. It also covers important concepts like connecting to different Spark cluster configurations and tuning Spark settings. Finally, it demonstrates an example workflow performing preprocessing, modeling, and visualization on the iris dataset using sparklyr functions.
This document provides information on importing and working with different data types in R. It introduces packages for importing files like SPSS, Stata, SAS, Excel, databases, JSON, XML, and APIs. It also covers functions for reading and writing common file types like CSV, TSV, and RDS. Finally, it discusses parsing data and handling missing values when reading files.
This document discusses various data structures in R programming including vectors, matrices, arrays, data frames, lists, and factors. It provides examples of how to create each structure and access elements within them. Various methods for importing and exporting data in different file formats like Excel, CSV, and text files are also covered.
This document discusses various techniques for manipulating data in R, including sorting, subsetting, ordering, reshaping between wide and long formats using the reshape2 package, and using plyr for efficient splitting and combining of large datasets. Specific functions and examples covered include sort(), order(), cut(), melt(), dcast(), and plyr functions. The goal is to demonstrate common ways to manipulate and rearrange data for further processing and analysis in R.
The slide shows a full gist of reading different types of data in R thanks to coursera it was much comprehensive and i made some additional changes too.
- Apply functions in R are used to apply a specified function to each column or row of R objects. Common apply functions include apply(), lapply(), sapply(), tapply(), vapply(), and mapply().
- The dplyr package is a powerful R package for data manipulation. It provides verbs like select(), filter(), arrange(), mutate(), and summarize() to work with tabular data.
- Functions like apply(), lapply(), sapply() apply a function over lists or matrices. Arrange() reorders data, mutate() adds new variables, and summarize() collapses multiple values into single values.
- R is a free software environment for statistical computing and graphics. It has an active user community and supports graphical capabilities.
- R can import and export data, perform data manipulation and summaries. It provides various plotting functions and control structures to control program flow.
- Debugging tools in R include traceback, debug, browser and trace which help identify and fix issues in functions.
R code can be used for various data manipulation tasks such as creating, recoding, and renaming variables; sorting and merging datasets; aggregating and reshaping data; and subsetting datasets. Specific R functions and operations allow users to efficiently manipulate data frames through actions like transposing data, calculating summary statistics, and selecting subsets of observations and variables.
This document provides an overview of the dplyr package in R. It describes several key functions in dplyr for manipulating data frames, including verbs like filter(), select(), arrange(), mutate(), and summarise(). It also covers grouping data with group_by() and joining data with joins like inner_join(). Pipelines of dplyr operations can be chained together using the %>% operator from the magrittr package. The document concludes that dplyr provides simple yet powerful verbs for transforming data frames in a convenient way.
The document discusses recent developments in the R programming environment for data analysis, including packages like magrittr, readr, tidyr, and dplyr that enable data wrangling workflows. It provides an overview of the key functions in these packages that allow users to load, reshape, manipulate, model, visualize, and report on data in a pipeline using the %>% operator.
The document outlines various statistical and data analysis techniques that can be performed in R including importing data, data visualization, correlation and regression, and provides code examples for functions to conduct t-tests, ANOVA, PCA, clustering, time series analysis, and producing publication-quality output. It also reviews basic R syntax and functions for computing summary statistics, transforming data, and performing vector and matrix operations.
RStudio is a trademark of RStudio, PBC. This document provides a cheat sheet on tidying data with the tidyr package in R. It defines tidy data as having each variable in its own column and each observation in its own row. It discusses tibbles as an enhanced data frame format and provides functions for constructing, subsetting, and printing tibbles. It also covers reshaping data through pivoting, handling missing values, expanding and completing data, and working with nested data through nesting, unnesting, and applying functions to list columns.
Is it easier to add functional programming features to a query language, or to add query capabilities to a functional language? In Morel, we have done the latter.
Functional and query languages have much in common, and yet much to learn from each other. Functional languages have a rich type system that includes polymorphism and functions-as-values and Turing-complete expressiveness; query languages have optimization techniques that can make programs several orders of magnitude faster, and runtimes that can use thousands of nodes to execute queries over terabytes of data.
Morel is an implementation of Standard ML on the JVM, with language extensions to allow relational expressions. Its compiler can translate programs to relational algebra and, via Apache Calcite’s query optimizer, run those programs on relational backends.
In this talk, we describe the principles that drove Morel’s design, the problems that we had to solve in order to implement a hybrid functional/relational language, and how Morel can be applied to implement data-intensive systems.
(A talk given by Julian Hyde at Strange Loop 2021, St. Louis, MO, on October 1st, 2021.)
MySQL is an open-source relational database management system based on SQL. It allows users to create, modify, and access database tables using standard SQL commands. Basic MySQL commands include CREATE TABLE, DROP TABLE, SELECT, INSERT, UPDATE, and DELETE.
MySQL is an open-source relational database management system based on SQL. It allows users to create, modify, and access database tables using standard SQL commands. Basic MySQL commands include CREATE TABLE, DROP TABLE, SELECT, INSERT, UPDATE, and DELETE.
This document provides a summary of key functions and commands for importing, managing, manipulating, and analyzing data in R. It covers topics such as importing and exporting data, data types, subsetting data, merging datasets, creating and sampling random data, summary statistics, and transforming data. The document is intended as a cheat sheet for common data management tasks in R.
Overview of a few ways to group and summarize data in R using sample airfare data from DOT/BTS's O&D Survey.
Starts with naive approach with subset() & loops, shows base R's tapply() & aggregate(), highlights doBy and plyr packages.
Presented at the March 2011 meeting of the Greater Boston useR Group.
Learn to manipulate strings in R using the built in R functions. This tutorial is part of the Working With Data module of the R Programming Course offered by r-squared.
This document provides a summary of frequently used commands in Stata with brief explanations. Key commands are highlighted for importing, exploring, summarizing, and viewing data. Other sections cover manipulating strings, combining data, and reshaping data.
The document discusses importing and exporting data in R. It describes how to import data from CSV, TXT, and Excel files using functions like read.table(), read.csv(), and read_excel(). It also describes how to export data to CSV, TXT, and Excel file formats using write functions. The document also demonstrates how to check the structure and dimensions of data, modify variable names, derive new variables, and recode categorical variables in R.
Here are the SQL commands for the questions:
Q1: SELECT PNAME FROM PROJECT WHERE PLOCATION='Houston';
Q2: SELECT FNAME, LNAME FROM EMPLOYEE WHERE HOURS>20;
Q3: SELECT FNAME, LNAME FROM EMPLOYEE, DEPARTMENT WHERE MGRSSN=SSN;
This document discusses various data structures in R programming including vectors, matrices, arrays, data frames, lists, and factors. It provides examples of how to create each structure and access elements within them. Various methods for importing and exporting data in different file formats like Excel, CSV, and text files are also covered.
This document discusses various techniques for manipulating data in R, including sorting, subsetting, ordering, reshaping between wide and long formats using the reshape2 package, and using plyr for efficient splitting and combining of large datasets. Specific functions and examples covered include sort(), order(), cut(), melt(), dcast(), and plyr functions. The goal is to demonstrate common ways to manipulate and rearrange data for further processing and analysis in R.
The slide shows a full gist of reading different types of data in R thanks to coursera it was much comprehensive and i made some additional changes too.
- Apply functions in R are used to apply a specified function to each column or row of R objects. Common apply functions include apply(), lapply(), sapply(), tapply(), vapply(), and mapply().
- The dplyr package is a powerful R package for data manipulation. It provides verbs like select(), filter(), arrange(), mutate(), and summarize() to work with tabular data.
- Functions like apply(), lapply(), sapply() apply a function over lists or matrices. Arrange() reorders data, mutate() adds new variables, and summarize() collapses multiple values into single values.
- R is a free software environment for statistical computing and graphics. It has an active user community and supports graphical capabilities.
- R can import and export data, perform data manipulation and summaries. It provides various plotting functions and control structures to control program flow.
- Debugging tools in R include traceback, debug, browser and trace which help identify and fix issues in functions.
R code can be used for various data manipulation tasks such as creating, recoding, and renaming variables; sorting and merging datasets; aggregating and reshaping data; and subsetting datasets. Specific R functions and operations allow users to efficiently manipulate data frames through actions like transposing data, calculating summary statistics, and selecting subsets of observations and variables.
This document provides an overview of the dplyr package in R. It describes several key functions in dplyr for manipulating data frames, including verbs like filter(), select(), arrange(), mutate(), and summarise(). It also covers grouping data with group_by() and joining data with joins like inner_join(). Pipelines of dplyr operations can be chained together using the %>% operator from the magrittr package. The document concludes that dplyr provides simple yet powerful verbs for transforming data frames in a convenient way.
The document discusses recent developments in the R programming environment for data analysis, including packages like magrittr, readr, tidyr, and dplyr that enable data wrangling workflows. It provides an overview of the key functions in these packages that allow users to load, reshape, manipulate, model, visualize, and report on data in a pipeline using the %>% operator.
The document outlines various statistical and data analysis techniques that can be performed in R including importing data, data visualization, correlation and regression, and provides code examples for functions to conduct t-tests, ANOVA, PCA, clustering, time series analysis, and producing publication-quality output. It also reviews basic R syntax and functions for computing summary statistics, transforming data, and performing vector and matrix operations.
RStudio is a trademark of RStudio, PBC. This document provides a cheat sheet on tidying data with the tidyr package in R. It defines tidy data as having each variable in its own column and each observation in its own row. It discusses tibbles as an enhanced data frame format and provides functions for constructing, subsetting, and printing tibbles. It also covers reshaping data through pivoting, handling missing values, expanding and completing data, and working with nested data through nesting, unnesting, and applying functions to list columns.
Is it easier to add functional programming features to a query language, or to add query capabilities to a functional language? In Morel, we have done the latter.
Functional and query languages have much in common, and yet much to learn from each other. Functional languages have a rich type system that includes polymorphism and functions-as-values and Turing-complete expressiveness; query languages have optimization techniques that can make programs several orders of magnitude faster, and runtimes that can use thousands of nodes to execute queries over terabytes of data.
Morel is an implementation of Standard ML on the JVM, with language extensions to allow relational expressions. Its compiler can translate programs to relational algebra and, via Apache Calcite’s query optimizer, run those programs on relational backends.
In this talk, we describe the principles that drove Morel’s design, the problems that we had to solve in order to implement a hybrid functional/relational language, and how Morel can be applied to implement data-intensive systems.
(A talk given by Julian Hyde at Strange Loop 2021, St. Louis, MO, on October 1st, 2021.)
MySQL is an open-source relational database management system based on SQL. It allows users to create, modify, and access database tables using standard SQL commands. Basic MySQL commands include CREATE TABLE, DROP TABLE, SELECT, INSERT, UPDATE, and DELETE.
MySQL is an open-source relational database management system based on SQL. It allows users to create, modify, and access database tables using standard SQL commands. Basic MySQL commands include CREATE TABLE, DROP TABLE, SELECT, INSERT, UPDATE, and DELETE.
This document provides a summary of key functions and commands for importing, managing, manipulating, and analyzing data in R. It covers topics such as importing and exporting data, data types, subsetting data, merging datasets, creating and sampling random data, summary statistics, and transforming data. The document is intended as a cheat sheet for common data management tasks in R.
Overview of a few ways to group and summarize data in R using sample airfare data from DOT/BTS's O&D Survey.
Starts with naive approach with subset() & loops, shows base R's tapply() & aggregate(), highlights doBy and plyr packages.
Presented at the March 2011 meeting of the Greater Boston useR Group.
Learn to manipulate strings in R using the built in R functions. This tutorial is part of the Working With Data module of the R Programming Course offered by r-squared.
This document provides a summary of frequently used commands in Stata with brief explanations. Key commands are highlighted for importing, exploring, summarizing, and viewing data. Other sections cover manipulating strings, combining data, and reshaping data.
The document discusses importing and exporting data in R. It describes how to import data from CSV, TXT, and Excel files using functions like read.table(), read.csv(), and read_excel(). It also describes how to export data to CSV, TXT, and Excel file formats using write functions. The document also demonstrates how to check the structure and dimensions of data, modify variable names, derive new variables, and recode categorical variables in R.
Here are the SQL commands for the questions:
Q1: SELECT PNAME FROM PROJECT WHERE PLOCATION='Houston';
Q2: SELECT FNAME, LNAME FROM EMPLOYEE WHERE HOURS>20;
Q3: SELECT FNAME, LNAME FROM EMPLOYEE, DEPARTMENT WHERE MGRSSN=SSN;
This document provides an introduction and overview of Stata, a statistical software package. It discusses what Stata is, the different versions of Stata, the Stata interface and how to navigate it. It also summarizes key functions for working with data in Stata, including importing and exporting data, exploring and managing variables, and conducting statistical analysis. The document is intended as a starting point for learning to use Stata.
Pandas is an open source Python library that provides data structures and data analysis tools for working with tabular data. It allows users to easily perform operations on different types of data such as tabular, time series, and matrix data. Pandas provides data structures like Series for 1D data and DataFrame for 2D data. It has tools for data cleaning, transformation, manipulation, and visualization of data.
This document provides an introduction to SQL (Structured Query Language). SQL is a language used to define, query, modify, and control relational databases. The document outlines the main SQL commands for data definition (CREATE, ALTER, DROP), data manipulation (INSERT, UPDATE, DELETE), and data control (GRANT, REVOKE). It also discusses SQL data types, integrity constraints, and how to use SELECT statements to query databases using projections, selections, comparisons, logical conditions, and ordering. The FROM clause is introduced as specifying the relations involved in a query.
Transpose and manipulate character strings Rupak Roy
Ìý
This document discusses techniques for manipulating data frames in R, including transposing data between wide and long formats using the reshape() function, extracting and transforming character strings using functions like substr() and grep(), and replacing patterns within strings using sub() and gsub(). Wide format stores variables in columns while long format stores them in rows. The melt() and dcast() functions are used to reshape between these formats.
MySQL is an open-source relational database management system based on SQL. It allows users to create, modify, and access database tables using standard SQL commands. Basic MySQL commands include CREATE TABLE, DROP TABLE, SELECT, INSERT, UPDATE, and DELETE.
This document provides a summary of programming commands and techniques in Stata. It discusses loops, macros, scalars, matrices, and accessing estimation results. Key commands covered include foreach, forvalues, levelsof, return, ereturn, estimates, matrix, scalar, global. The document is intended as a cheat sheet for common Stata programming tasks.
This document provides a summary of SAS language elements, procedures, functions, formats, and the macro language. It includes brief descriptions of commonly used statements, such as DATA, SET, IF/THEN, FORMAT, and PROC, as well as summaries of various procedures like FREQ, MEANS, REPORT and SORT. It also outlines important macro language elements such as %DO, %LET, and macro quoting functions.
The document discusses various ways to use @Formula in Lotus Notes and XPages applications. It covers using @Formula for input validation, computed values, view selection formulas, and more. Specific @functions discussed include @Success, @Failure, @If, @Trim, @ProperCase, @LowerCase, @ReplaceSubstring, @Round, @Random, @ThisValue, @ThisName, @SetEnvironment, @Environment, @Adjust, @Text, @Unique, @Transform, @Sort, @Max, @Min, and @Matches. Examples are provided for how to use many of these @functions.
Disconnected Architecture and Crystal report in VB.NETEverywhere
Ìý
This document discusses disconnected architecture in ADO.NET. It explains that ADO.NET uses a dataset object to enable disconnected data access through filling the dataset using a data adapter. The dataset acts as an in-memory cache of data and does not interact directly with the data source. Data tables within the dataset contain rows and columns of data. The data adapter acts as a bridge between the dataset and data source, using commands to fill the dataset from a query and update changes back to the source. Stored procedures can also be used to encapsulate database operations when working with a dataset in a disconnected manner.
The document provides information about MySQL including:
1. MySQL is an open source relational database management system based on SQL that is used to add, remove, and modify information in databases.
2. It describes basic MySQL commands like CREATE TABLE, DROP TABLE, SELECT, INSERT, UPDATE, and provides syntax examples.
3. It also covers advanced commands, functions in MySQL like aggregate functions, numeric functions and string functions as well as stored procedures.
This document provides an introduction to manipulating tables in MySQL, including how to alter tables by adding, dropping, and modifying columns, how to insert and import data into tables from files using commands like INSERT, REPLACE, LOAD DATA INFILE and the mysqlimport utility, and how to perform updates and queries on table data using commands like UPDATE and SELECT. Examples are given for each task to demonstrate the proper syntax and usage.
This document discusses various data flow transformations in SQL Server Integration Services (SSIS). It begins with an introduction to the different types of transformations, including row transformations and rowset transformations. It then provides examples and demonstrations of specific transformations like Character Map, Derived Column, Aggregate, Pivot, and Percentage Sampling. The document aims to explain how each transformation works and how it can be used to modify or aggregate data in an SSIS data flow.
The document is a cheat sheet for data wrangling with pandas, providing syntax and examples for common data wrangling tasks like creating and reshaping DataFrames, subsetting data, combining datasets, filtering and joining data, grouping data, handling missing values, and creating plots. It covers topics such as creating DataFrames from lists or dictionaries, melting and pivoting data, sorting and ordering rows, subsetting rows and columns, merging datasets, and applying functions across columns or groups.
The document is a cheat sheet for data wrangling with pandas, providing syntax and methods for creating and manipulating DataFrames, reshaping and subsetting data, summarizing data, combining datasets, filtering and joining data, grouping data, handling missing values, and plotting data. Key methods described include pd.melt() to gather columns into rows, pd.pivot() to spread rows into columns, pd.concat() to append DataFrames, df.sort_values() to order rows by column values, and df.groupby() to group data.
This document provides a summary of key pandas syntax and methods for data wrangling and manipulation. It covers topics such as creating and reshaping DataFrames, subsetting data, handling missing values, summarizing data, joining datasets, and grouping and aggregating data. The cheat sheet also describes functions for vector and group-based operations, as well as window functions for rolling and expanding calculations. Overall, it serves as a concise reference guide for the main pandas techniques used in data analysis workflows.
Today's presentation will cover reshaping data using the reshape2 package and creating tables in LaTeX using the xtables package. It will demonstrate melting and casting data to change its format, and how to style tables created with xtables to follow best practices using features from the booktabs package. Advanced techniques for xtables include adding captions, labels, and removing row names.
The document is a cheat sheet for data wrangling with pandas, providing syntax and examples for common data wrangling tasks like creating and reshaping DataFrames, subsetting data, combining datasets, filtering and joining data, grouping data, handling missing values, and creating plots. It includes functions and methods for tasks like melting and pivoting data, sorting and ordering rows, subsetting rows and columns, merging and joining datasets, applying operations to groups and windows of data, and replacing or dropping missing values.
Optimizing Common Table Expressions in Apache Hive with CalciteStamatis Zampetakis
Ìý
In many real-world queries, certain expressions may appear multiple times, requiring repeated computations to construct the final result. These recurring computations, known as common table expressions (CTEs), can be explicitly defined in SQL queries using the WITH clause or implicitly derived through transformation rules. Identifying and leveraging CTEs is essential for reducing the cost of executing complex queries and is a critical component of modern data management systems.
Apache Hive, a SQL-based data management system, provides powerful mechanisms to detect and exploit CTEs through heuristic and cost-based optimization techniques.
This talk delves into the internals of Hive's planner, focusing on its integration with Apache Calcite for CTE optimization. We will begin with a high-level overview of Hive's planner architecture and its reliance on Calcite in various planning phases. The discussion will then shift to the CTE rewriting phase, highlighting key Calcite concepts and demonstrating how they are employed to optimize CTEs effectively.
Cost sheet. with basics and formats of sheetsupreetk82004
Ìý
Cost sheet. with basics and formats of sheetCost sheet. with basics and formats of sheetCost sheet. with basics and formats of sheetCost sheet. with basics and formats of sheetCost sheet. with basics and formats of sheetCost sheet. with basics and formats of sheetCost sheet. with basics and formats of sheetCost sheet. with basics and formats of sheetCost sheet. with basics and formats of sheetCost sheet. with basics and formats of sheetCost sheet. with basics and formats of sheetCost sheet. with basics and formats of sheetCost sheet. with basics and formats of sheetCost sheet. with basics and formats of sheetCost sheet. with basics and formats of sheet
CloudMonitor - Architecture Audit Review February 2025.pdfRodney Joyce
Ìý
CloudMonitor FinOps is now a Microsoft Certified solution in the Azure Marketplace. This little badge means that we passed a 3rd-party Technical Audit as well as met various sales KPIs and milestones over the last 12 months.
We used our existing Architecture docs for CISOs and Cloud Architects to craft an Audit Response - I've shared it below to help others obtain their cert.
Interestingly, 90% of our customers are in the USA, with very few in Australia. This is odd as the first thing I hear in every meetup and conference, from partners, customers and Microsoft, is that they want to optimise their cloud spend! But very few Australian companies are using the FinOps Framework to lower Azure costs.
The Role of Christopher Campos Orlando in Sustainability Analyticschristophercamposus1
Ìý
Christopher Campos Orlando specializes in leveraging data to promote sustainability and environmental responsibility. With expertise in carbon footprint analysis, regulatory compliance, and green business strategies, he helps organizations integrate sustainability into their operations. His data-driven approach ensures companies meet ESG standards while achieving long-term sustainability goals.
Valkey 101 - SCaLE 22x March 2025 Stokes.pdfDave Stokes
Ìý
An Introduction to Valkey, Presented March 2025 at the Southern California Linux Expo, Pasadena CA. Valkey is a replacement for Redis and is a very fast in memory database, used to caches and other low latency applications. Valkey is open-source software and very fast.
Data Science Lectures Data Science Lectures Data Science Lectures Data Science Lectures Data Science Lectures Data Science Lectures Data Science Lectures Data Science Lectures Data Science Lectures Data Science Lectures Data Science Lectures Data Science Lectures
19th Edition Of International Research Data Analysis Excellence Awardsdataanalysisconferen
Ìý
19th Edition Of International Research Data Analysis Excellence Awards
International Research Data Analysis Excellence Awards is the Researchers and Research organizations around the world in the motive of Encouraging and Honoring them for their Significant contributions & Achievements for the Advancement in their field of expertise. Researchers and scholars of all nationalities are eligible to receive ScienceFather Research Data Analysis Excellence Awards. Nominees are judged on past accomplishments, research excellence and outstanding academic achievements.
Place: San Francisco, United States
Visit Our Website: https://researchdataanalysis.com
Nomination Link: https://researchdataanalysis.com/award-nomination
19th Edition Of International Research Data Analysis Excellence Awardsdataanalysisconferen
Ìý
Stata cheat sheet: data transformation
1. Tim Essam (tessam@usaid.gov) • Laura Hughes (lhughes@usaid.gov) inspired by RStudio’s awesome Cheat Sheets (rstudio.com/resources/cheatsheets) geocenter.github.io/StataTraining updated March 2016
Disclaimer: we are not affiliated with Stata. But we like it. CC BY NC
Data Transformation
with Stata 14.1 Cheat Sheet
For more info see Stata’s reference manual (stata.com)
export delimited "myData.csv", delimiter(",") replace
export data as a comma-delimited file (.csv)
export excel "myData.xls", /*
*/ firstrow(variables) replace
export data as an Excel file (.xls) with the
variable names as the first row
Save & Export Data
save "myData.dta", replace
saveold "myData.dta", replace version(12)
save data in Stata format, replacing the data if
a file with same name exists
Stata 12-compatible file
Manipulate Strings
display trim(" leading / trailing spaces ")
remove extra spaces before and after a string
display regexr("My string", "My", "Your")
replace string1 ("My") with string2 ("Your")
display stritrim(" Too much Space")
replace consecutive spaces with a single space
display strtoname("1Var name")
convert string to Stata-compatible variable name
TRANSFORM STRINGS
display strlower("STATA should not be ALL-CAPS")
change string case; see also strupper, strproper
display strmatch("123.89", "1??.?9")
return true (1) or false (0) if string matches pattern
list make if regexm(make, "[0-9]")
list observations where make matches the regular
expression (here, records that contain a number)
FIND MATCHING STRINGS
GET STRING PROPERTIES
list if regexm(make, "(Cad.|Chev.|Datsun)")
return all observations where make contains
"Cad.", "Chev." or "Datsun"
list if inlist(word(make, 1), "Cad.", "Chev.", "Datsun")
return all observations where the first word of the
make variable contains the listed words
compare the given list against the first word in make
charlist make
display the set of unique characters within a string
* user-defined package
replace make = subinstr(make, "Cad.", "Cadillac", 1)
replace first occurrence of "Cad." with Cadillac
in the make variable
display length("This string has 29 characters")
return the length of the string
display substr("Stata", 3, 5)
return the string located between characters 3-5
display strpos("Stata", "a")
return the position in Stata where a is first found
display real("100")
convert string to a numeric or missing value
_merge code
row only
in ind2
row only
in hh2
row in
both
1
(master)
2
(using)
3
(match)
Combine Data
ADDING (APPENDING) NEW DATA
MERGING TWO DATASETS TOGETHER
FUZZY MATCHING: COMBINING TWO DATASETS WITHOUT A COMMON ID
merge 1:1 id using "ind_age.dta"
one-to-one merge of "ind_age.dta"
into the loaded dataset and create
variable "_merge" to track the origin
webuse ind_age.dta, clear
save ind_age.dta, replace
webuse ind_ag.dta, clear
merge m:1 hid using "hh2.dta"
many-to-one merge of "hh2.dta"
into the loaded dataset and create
variable "_merge" to track the origin
webuse hh2.dta, clear
save hh2.dta, replace
webuse ind2.dta, clear
append using "coffeeMaize2.dta", gen(filenum)
add observations from "coffeeMaize2.dta" to
current data and create variable "filenum" to
track the origin of each observation
webuse coffeeMaize2.dta, clear
save coffeeMaize2.dta, replace
webuse coffeeMaize.dta, clear
load demo dataid blue pink
+
id blue pink
id blue pink
should
contain
the same
variables
(columns)
MANY-TO-ONE
id blue pink id brown blue pink brown _merge
3
3
1
3
2
1
3
. .
.
.
id
+ =
ONE-TO-ONE
id blue pink id brown blue pink brownid _merge
3
3
3
+ =
must contain a
common variable
(id)
match records from different data sets using probabilistic matchingreclink
create distance measure for similarity between two strings
ssc install reclink
ssc install jarowinklerjarowinkler
Reshape Data
webuse set https://github.com/GeoCenter/StataTraining/raw/master/Day2/Data
webuse "coffeeMaize.dta" load demo dataset
xpose, clear varname
transpose rows and columns of data, clearing the data and saving
old column names as a new variable called "_varname"
MELT DATA (WIDE → LONG)
reshape long coffee@ maize@, i(country) j(year)
convert a wide dataset to long
reshape variables starting
with coffee and maize
unique id
variable (key)
create new variable which captures
the info in the column names
CAST DATA (LONG → WIDE)
reshape wide coffee maize, i(country) j(year)
convert a long dataset to wide
create new variables named
coffee2011, maize2012...
what will be
unique id
variable (key)
create new variables
with the year added
to the column name
When datasets are
tidy, they have a
c o n s i s t e n t ,
standard format
that is easier to
manipulate and
analyze.
country
coffee
2011
coffee
2012
maize
2011
maize
2012
Malawi
Rwanda
Uganda cast
melt
Rwanda
Uganda
Malawi
Malawi
Rwanda
Uganda 2012
2011
2011
2012
2011
2012
year coffee maizecountry
WIDE LONG (TIDY) TIDY DATASETS have
each observation
in its own row and
each variable in its
own column.
new variable
Label Data
label list
list all labels within the dataset
label define myLabel 0 "US" 1 "Not US"
label values foreign myLabel
define a label and apply it the values in foreign
Value labels map string descriptions to numers. They allow the
underlying data to be numeric (making logical tests simpler)
while also connecting the values to human-understandable text.
Replace Parts of Data
rename (rep78 foreign) (repairRecord carType)
rename one or multiple variables
CHANGE COLUMN NAMES
recode price (0 / 5000 = 5000)
change all prices less than 5000 to be $5,000
recode foreign (0 = 2 "US")(1 = 1 "Not US"), gen(foreign2)
change the values and value labels then store in a new
variable, foreign2
CHANGE ROW VALUES
useful for exporting datamvencode _all, mv(9999)
replace missing values with the number 9999 for all variables
mvdecode _all, mv(9999)
replace the number 9999 with missing value in all variables
useful for cleaning survey datasets
REPLACE MISSING VALUES
replace price = 5000 if price < 5000
replace all values of price that are less than $5,000 with 5000
Select Parts of Data (Subsetting)
FILTER SPECIFIC ROWS
drop in 1/4drop if mpg < 20
drop observations based on a condition (left)
or rows 1-4 (right)
keep in 1/30
opposite of drop; keep only rows 1-30
keep if inlist(make, "Honda Accord", "Honda Civic", "Subaru")
keep the specified values of make
keep if inrange(price, 5000, 10000)
keep values of price between $5,000 – $10,000 (inclusive)
sample 25
sample 25% of the observations in the dataset
(use set seed # command for reproducible sampling)
SELECT SPECIFIC COLUMNS
drop make
remove the 'make' variable
keep make price
opposite of drop; keep only columns 'make' and 'price'