[Stata] Data cleaning 7: Working with string variables (destring, tostring, encode, and decode) – Nari's Research Log 您所在的位置:网站首页 国内十大知名品牌英文名称是什么 [Stata] Data cleaning 7: Working with string variables (destring, tostring, encode, and decode) – Nari's Research Log

[Stata] Data cleaning 7: Working with string variables (destring, tostring, encode, and decode) – Nari's Research Log

2023-12-13 01:25| 来源: 网络整理| 查看: 265

Data preparation is often said to occupy 80% of the data analysis process. Ensuring that your data is clean, accurate, and in the right format is crucial before performing any statistical analysis. For those using Stata, managing and cleaning string variables (text data) can initially seem challenging, but with several commands, it becomes a smooth process.

This blog post will delve into four Stata commands that will simplify the way you handle text data: destring, tostring, encode, and decode. Whether you’re a Stata newbie or need a quick refresher, this guide is here to assist you.

What are string variables?

String variables are essentially sequences of characters. They can contain anything from letters, numbers, and spaces to other special characters. In Stata, string variables are easily identifiable when you open the data browser or run codebook commands. Here, we are going to use the example by STATA “hbp2.dta”

Statawebuse hbp2.dta

In the data browser, the string variables are in wine color, compared to the numeric variables in black colors.

When you run the command codebook, it will also show whether the variable is string or numeric.

Statacodebook varname

The number after str in the type shows the maximum length of the characters in that specific variable (e.g., str6, str12 or str50).

Why can’t I directly analyze my string data?

While strings are flexible and can store a variety of data, Stata and many other statistical software prefer numerical data for data analysis. If you try to run the commands such as regression with string variables, it will return the error message. Hence, converting or managing string variables becomes important.

The destring Command

The destring command converts string variables that represent numbers into numeric variables. This is particularly useful when data is imported, and numbers are mistakenly read as strings, even though all the text in the variable are numbers.

For example, here, id includes only numbers in the values, so it would be better to convert them to the numeric variables, such as schno.

Statadestring varname, replace // replace string variables in varlist with numeric variables destring varname, generate(newvarname) // generate newvarname seperately

If it is successful, it will return the following message:

Tip. Contains non-numeric characters error

If there are characters within your string data that can’t be converted into numbers, Stata will show an error.

Solution 1: Use the ignore() option. For example, destring income, replace ignore("$,") if the text is formatted “$10,000” Solution 2: Use the force option. It will destring it by converting the values with non-numeric characters into missing values. Statadestring varname, replace force destring varname, replace ignore(" ") // Remove the spaces in varname and convert it to a numeric variable, replacing the original string variable. You can put other text in ignore option. The force option will convert non-numeric characters into missing values.

You can also check if it is converted to Numeric well, by using the command codebook.

The tostring Command

Converting numerical data to string format: Why and when? Sometimes, numbers are better treated as string values (like phone numbers or zip codes).

Statatostring varname, replace

When you would love to “concatenate” the numbers, tostring becomes useful. For example, if you would love to create a unique id with statefips (e.g., 10) + countyfips (e.g., 002), you can tostring them and then create the unique id by using generate command (e.g., 10002). If you don’t tostring, the generate command will return the sum of stsatefips and countyfips values (10+2 = 12), instead of 10002. The generate command concatenates any text when they are coded as string variables in STATA.

Statatostring statefips, replace tostring countyfips, replace gen fips_id = statefips + countyfips

Here is another example of working with the different number of characters, in dealing with creating the id with tostring.

https://twitter.com/toddrjones/status/1699897399416955350?s=20

The encode Command: convert string to categorical variable

Mapping text to numbers is the idea behind encoding. When we have categorical data like “Low”, “Medium”, “High”, it might be useful to convert them into numbers like 1, 2, 3 for easier analysis. You can’t use replace options with encode command. Here, the variable “sex” is better to be treated as a numeric (and categorical) variable.

Stataencode varname, generate(newvarname)

By running encode command, now the sex2 variable is Numeric with values (automatically assigned) and labels (in the original variable). You can also check if it is coded well in the data browser (browse), by ordering them.

Stataorder newvarname, after(varname) broswe

If you would love to replace the original string variable with an encoded variable, you need to encode first, drop the original variable, and rename the new variable as an original variable name. It is a tedious task if you would love to do it with a lot of variables. You can use the loop for this.

Stataforeach v of varlist var1 var2 var3 { replace `v' = "" if `v' == "." encode `v', generate (new`v') drop `v' rename new`v' `v' } // put the list of variable that you would love to enocde after varlist

The decode Command

The reverse of encoding: Retrieving the original text data. If you’ve encoded a variable and need to revert to its original string format, decode will convert it again.

Statadecode varname, generate(newvarname)

The decode command will also convert the categorical variables into the string values.

Other commands

You can find more useful commands for string data cleaning in STATA, such as lower, upper, subinstr, substr, and strpos here.

WORKING WITH STRINGS

How to identify which command to use If string values are all numbers (e.g., id), use destring. If string values are not numbers (e.g., sex), use encode. To convert numeric variables to string variables, use tostring. To revert categorical variables back to the string values, use decode.

Some other tips

Always save your data before making changes: save "filename.dta" Check consistency in categorical data after encoding.

Share this:Click to share on Facebook (Opens in new window)Click to share on Twitter (Opens in new window)Click to share on WhatsApp (Opens in new window)Click to share on LinkedIn (Opens in new window)


【本文地址】

公司简介

联系我们

今日新闻

    推荐新闻

    专题文章
      CopyRight 2018-2019 实验室设备网 版权所有