使用文本功能清理数据

The data that you obtain from different sources many not be in a form ready for analysis. In this chapter, you will understand how to prepare your data that is in the form of text for analysis.

Initially, you need to clean the data. Data cleaning includes removing unwanted characters from text. Next, you need to structure the data in the form you require for further analysis. You can do the same by −

  • Finding required text patterns with the text functions.
  • Extracting data values from text.
  • Formatting data with text functions.
  • Executing data operations with the text functions.

Removing Unwanted Characters from Text

When you import data from another application, it can have nonprintable characters and/or excess spaces. The excess spaces can be −

  • leading spaces, and/or
  • extra spaces between words.

If you sort or analyze such data, you will get erroneous results.

Consider the following example −

Product Data

This is the raw data that you have obtained on product information containing the Product ID, Product description and the price. The character “|” separates the field in each row.

When you import this data into Excel worksheet, it looks as follows −

Import Data

As you observe, the entire data is in a single column. You need to structure this data to perform data analysis. However, initially you need to clean the data.

You need to remove any nonprintable characters and excess spaces that might be present in the data. You can use the CLEAN function and TRIM function for this purpose.

S.No. Function & Description
1.

CLEAN

Removes all nonprintable characters from text

2.

TRIM

Removes spaces from text

  • Select the Cells C3 – C11.
  • Type =TRIM (CLEAN (B3)) and then press CTRL + Enter.

The formula is filled in the cells C3 – C11.

Formula Filled

The result will be as shown below −

Formula Filled Result

Finding required Text Patterns with the Text Functions

To structure your data, you might have to do certain Text Pattern matching based on which you can extract the Data Values. Some of the Text Functions that are useful for this purpose are −

S.No. Function & Description
1.

EXACT

Checks to see if two text values are identical

2.

FIND

Finds one text value within another (case-sensitive)

3.

SEARCH

Finds one text value within another (not case-sensitive)

Extracting Data Values from Text

You need to extract the required data from text in order to structure the same. In the above example, say, you need to place the data in three columns – ProductID, Product_Description and Price.

You can extract data in one of the following ways −

  • Extracting Data Values with Convert Text to Columns Wizard
  • Extracting Data Values with Text Functions
  • Extracting Data Values with Flash Fill

Extracting Data Values with Convert Text to Columns Wizard

You can use the Convert Text to Columns Wizard to extract Data Values into Excel columns if your fields are −

  • Delimited by a character, or
  • Aligned in columns with spaces between each field.

In the above example, the fields are delimited by the character “|”. Hence, you can use the Convert Text to Columns wizard.

  • Select the data.

  • Copy and paste values in the same place. Otherwise, Convert Text to Columns takes the functions rather than the data itself as the input.

Convert Text to Columns
  • Select the data.

  • Click on Text to Columns in the Data Tools group under Data Tab on the Ribbon.

Step 1 − Convert Text to Columns Wizard - Step 1 of 3 appears.

  • Select Delimited.
  • Click Next.
Convert Text to Columns Step1

Step 2 − Convert Text to Columns Wizard - Step 2 of 3 appears.

  • Under Delimiters, select Other.

  • In the box next to Other, type the character |

  • Click Next.

Convert Text to Columns Step2

Step 3 − Convert Text to Columns Wizard - Step 3 of 3 appears.

In this screen, you can select each column of your data in the wizard and set the format for that column.

  • For Destination, select the cell D3.

  • You can click Advanced, and set Decimal Separator and Thousands Separator in the Advanced Text Import Settings dialog box that appears.

  • Click Finish.

Convert Text to Columns Step3

Your data, which is converted to columns appears in the three Columns – D, E and F.

  • Name the Column headers as ProductID, Product_Description and Price.
Name Column Headers

Extracting Data Values with Text Functions

Suppose the fields in your data neither are delimited by a character nor are aligned in columns with spaces between each field, you can use text functions to extract data values. Even in the case the fields are delimited, you can still use text functions to extract data.

Some of the text functions that are useful for this purpose are −

S.No. Function & Description
1.

LEFT

Returns the leftmost characters from a text value

2.

RIGHT

Returns the rightmost characters from a text value

3.

MID

Returns a specific number of characters from a text string starting at the position you specify

4.

LEN

Returns the number of characters in a text string

You can also combine two or more of these text functions as per the data you have at hand, to extract the required data values. For example, using a combination of LEFT, RIGHT and VALUE functions or using a combination of FIND, LEFT, LEN and MID functions.

In the above example,

  • All the characters left to the first | give the name ProductID.

  • All the characters right to the second | give the name Price.

  • All the characters that lie between the first | and second | give the name Product_Description.

  • Each | has a space before and after.

Observing this information, you can extract the data values with the following steps −

  • Find the Position of First | - First | Position

    • You can use FIND function

  • Find the Position of Second | - Second | Position

    • You can use FIND function again

  • Beginning to (First | Position – 2) Characters of the Text give ProductID

    • You can use LEFT Function

  • (First | Position + 2) to (Second | Position - 2) Characters of the Text give Product_Description

    • You can use MID Function

  • (Second | Position + 2) to End Characters of the Text give Price

    • You can use RIGHT Function

Extract Data Values

The result will be as shown below −

Extract Data Values Result

You can observe that the values in the price column are text values. To perform calculations on these values, you have to format the corresponding cells. You can look at the section given below to understand formatting text.

Extracting Data Values with Flash Fill

Using Excel Flash Fill is another way to extract data values from text. However, this works only when Excel is able to find a pattern in the data.

Step 1 − Create three columns for ProductID, Product_Description and Price next to the data.

Create Columns

Step 2 − Copy and paste the values for C3, D3 and E3 from B3.

Paste Values

Step 3 − Select cell C3 and click Flash Fill in the Data Tools group on the Data tab. All the values for ProductID get filled.

Flash Fill

Step 4 − Repeat the above given steps for Product_Description and Price. The data is filled.

Repeat Steps

Formatting Data with Text Functions

Excel has several built-in text functions that you can use for formatting data containing text. These include −

Functions that format the Text as per your need

S.No. Function & Description
1.

LOWER

Converts text to lowercase

S.No. Function & Description
1.

UPPER

Converts text to uppercase

2.

PROPER

Capitalizes the first letter in each word of a text value

Functions that convert and/or format the Numbers as Text

S.No. Function & Description
1.

DOLLAR

Converts a number to text, using the $ (dollar) currency format

2.

FIXED

Formats a number as text with a fixed number of decimals

3.

TEXT

Formats a number and converts it to text

Functions that convert the Text to Numbers

S.No. Function & Description
1.

VALUE

Converts a text argument to a number

Executing Data Operations with the Text Functions

You might have to perform certain Text Operations on your Data. For example, if Login-IDs for the Employees are changed to a New Format in an Organization, based on the Format Change, Text Replacements might have to be done.

Following Text Functions help you in performing Text Operations on your data containing Text −

S.No. Function & Description
1.

REPLACE

Replaces characters within text

2.

SUBSTITUTE

Substitutes new text for old text in a text string

3.

CONCATENATE

Joins several text items into one text item

4.

CONCAT

Combines the text from multiple ranges and/or strings, but it does not provide the delimiter or IgnoreEmpty arguments.

5.

TEXTJOIN

Combines the text from multiple ranges and/or strings, and includes a delimiter you specify between each text value that will be combined. If the delimiter is an empty text string, this function will effectively concatenate the ranges.

6.

REPT

Repeats text a given number of times