Spreadsheets are the ultimate tool for data cleaning. They make it easy to assess what data needs to be cleaned and have built-in tools to efficiently clean big data sets in seconds.
Row Zero is an enterprise-grade spreadsheet that works like Excel and Google Sheets but is 1000x more powerful, so you can clean massive data sets. You can also connect directly to your data source to import data, automate spreadsheet updates, and export clean data to your database or data warehouse.
In this guide, we’ll give tips on big data cleaning in spreadsheets using Row Zero, but many of these techniques can be used to clean data in Excel and Google Sheets on smaller datasets.
10 tips for big data cleaning in spreadsheets
Spreadsheets offer a wide range of functions and features for cleaning big datasets. You can easily make one-off changes, bulk update, or automate data cleaning. Here are data cleaning tips across 10 categories:
- Leverage the core spreadsheet features
- Handling duplicates
- Text cleaning and formatting
- Splitting and parsing data
- Combining and concatenating
- Joining data and mapping values
- Handling errors, missing data, and inconsistent data
- Date / time cleanup and grouping
- Cleaning data with Python
- Two-way data sync
1. Leverage the core spreadsheet features
The core spreadsheet features are data cleaning super powers, especially for non-technical users. With a one-click data import, you can view entire datasets and easily filter, sort, format, add, edit, and delete. Row Zero offers 200+ functions, dynamic pivot tables, charts, etc.
Row Zero's big unlock is it lets you work with 1000x bigger datasets in spreadsheets compared to Excel and Google Sheets, so you can use spreadsheet features and functions to clean big datasets quickly.
2. Handling duplicates
One of the most common data cleaning techniques is removing duplicates and/or finding, highlighting, and counting unique values. In addition to the remove duplicates feature, there are several tools for managing unique and duplicate values automatically as your data updates.
- Remove duplicates - This feature automatically deletes duplicate rows from the sheet and retains the first row that includes the duplicated value. You select the column(s) that you want to remove duplicates from and then all matching duplicate rows are deleted.Important tip: Be sure to sort on the attribute you want to keep before removing duplicates (for example, the first or last date/time).
While the remove duplicates feature is very helpful, it is a one-time action. You can automatically remove duplicates as new data updates with connected tables and any of the following:
- UNIQUE - The UNIQUE function creates a list of unique values from a specified range and can be used to dynamically remove duplicates as data updates.
- HISTOGRAM - The HISTOGRAM function creates a table of unique values from a specified range and counts how many times each value appears. It's a fast way to keep track of duplicate counts.
- COUNTIF - You can use the COUNTIF and COUNTIFS functions to create a counter column to count the number of times a value appears in a range. This unlocks several ways to filter, sort, highlight, and clean data without just bulk deleting everything with remove duplicates.
- COUNTUNIQUE counts the number of unique values in a range and COUNTUNIQUEIFS counts unique values that meet multiple criteria.
- Dynamic pivot tables - Row Zero pivot tables dynamically update with new data and can be built on connected tables that sync with your data source to pull in the latest data. By default, pivot tables only include unique rows for each set of fields assigned to Rows, so they are an easy way to remove duplicates and/or show and summarize data by unique values. You can also use the built-in pivot table count unique and count calculations to count unique values and duplicate values.
- Conditional format duplicates - You can use Row Zero's conditional formatting feature to find and highlight duplicate values but not remove them. Just set the Condition to Duplicate to highlight duplicates with your desired formatting. You can also highlight duplicates across multiple columns.
3. Text cleaning and formatting
There are a number of useful functions for cleaning and formatting text data:
- TEXT converts values to text and makes it easy to custom format numbers and dates using custom format codes.
- VALUE and NUMBERVALUE functions convert text to numbers and can fix import issues.
- TRIM removes all spaces from text except for single spaces between words
- CLEAN removes all nonprintable characters from text and is useful for cleaning data imported from other applications that may contain hidden control characters.
- UPPER, LOWER, and PROPER functions fix inconsistent casing.
- SUBSTITUTE, REPLACE, and REGEXREPLACE functions are used to replace text (e.g. replace dashes with spaces) and insert new text at specific positions.
4. Splitting and parsing data
- Split text to columns - This built-in tool makes it easy to split text to multiple columns using a custom separator. This can be useful for parsing unparsed CSV data or splitting any text field on a common separator (e.g. splitting email addresses on @ symbol)
- TEXTBEFORE and TEXTAFTER functions return the text before or after a given delimitter and are very useful for splitting text data like full names, addresses, email addresses, etc.
- LEFT, RIGHT, and MID functions extract parts of a string
- FIND and SEARCH functions are used to locate characters/strings inside text
5. Combining and concatenating
After parsing and cleaning data, it's common to recombine cleaned data fields.
- CONCAT - Concatenates a list of values into a single string
- TEXTJOIN - Concatenates a list of values into a single string, with a specified delimiter and can optionally ignore empty cells
6. Joining data and mapping values
You can use spreadsheet functions to join multiple datasets and/or map values to a standard list:
- XLOOKUP looks up a specified value in a range and returns a corresponding value from another range. It's a common spreadsheet function for joining and mapping data across datasets.
- MULTILOOKUP looks up values based on multiple criteria.
- CHOOSE returns the value at a specified position in a list of values
- IFS can be used to standardize values, e.g. =IFS(A2<4, "Low", A2<7, "Med", A2>=7, "High")
Row Zero also supports VLOOKUP, but XLOOKUP is better in all situations.
7. Handling errors, missing data, and inconsistent data
Handling errors, blanks, missing data, and inconsistent data is a key part of data cleaning. Here are some core functions for cleaning bad data.
- IFERROR lets you customize what to show when errors are returned from a formula. You can show blanks (using "") or something explicit like "Missing" or "Error".
- ISBLANK and COUNTBLANK check for blanks and count blanks.
- Wildcards - Wildcards can be used for pattern matching, which is extremely useful for data cleaning, especially when dealing with inconsistent or partial values. Wildcards let you match inconsistent or unknown parts of a string.
- The asterisk wildcard (*) is evaluated as any sequence of characters and is helpful when working with values with inconsistent names (e.g. Fruit Co, Fruit Co. and Fruit Company in the example below).
- The question mark wildcard (?) is evaluated as any single character. For example, "???" could count every value that is 3 characters.
- The asterisk wildcard (*) is evaluated as any sequence of characters and is helpful when working with values with inconsistent names (e.g. Fruit Co, Fruit Co. and Fruit Company in the example below).
Many functions also have built-in error handling. For example, XLOOKUP has an optional [if_not_found] argument that lets you specify the value to return if the key is not found.
8. Date and time cleanup and grouping
There are several functions to help clean up dates and times and summarize data by dates
- TEXT - As mentioned above, the TEXT function let's you custom format dates and times using custom format codes
- DATEPART - Extracts the date part of a datetime
- TIMEPART - Extracts the time part of a datetime
- WEEKDAY - Returns the weekday (1-7, starting at Sunday) of a date
- WEEKNUM - Returns the week number of a date
- MONTH - Returns the month (1-12) of a date
- YEAR - Returns the year of a date
You can also use the built-in pivot table group by date feature to group by month, week, etc.
9. Cleaning data with Python
The code window, accessed in the top right of any Row Zero workbook is a native Python development environment that works like a code cell in a Jupyter notebook. You can write custom Python functions that are referenced in the spreadsheet and import popular Python libraries like scipy and pandas to clean data.
10. Two-way data sync
Row Zero has built-in connectors to your database or data warehouse that make it easy to import data and automatically clean, analyze and transform your data. You can also write-back clean data to your database or data warehouse.
You can use Row Zero as a spreadsheet GUI for database cleansing as well as data preprocessing - you can open big files, clean them up, and then import directly to your database or data warehouse.
How to clean big files in spreadsheets
Row Zero makes it easy to clean big files in a spreadsheet. You can open large files with a simple one-click import and use the spreadsheet features highlighted above to clean your file and then analyze, share, or export the data to your data warehouse. Row Zero's big data power makes it easy to open and edit big CSV files, parquet, JSONL, TXT, XLSX, etc. Row Zero also automatically unzips and opens .gz files.
Database cleansing with a spreadsheet GUI
Row Zero's big data power makes it possible to import entire database tables into the spreadsheet and leverage the above features in the spreadsheet to clean database tables. Using Row Zero as a spreadsheet GUI, you can easily view full database tables, remove duplicates, add calculated columns, pivot data tables, and add, edit, and delete rows and fields. At any point, you can write-back the clean data to the database as a new table.
Data preprocessing with spreadsheets
Spreadsheets are a common tool for data preprocessing since they make it easy to view and clean raw data for accurate data analysis, modeling, and machine learning. Row Zero's big data power makes it possible to preprocess big data sets in the spreadsheet and Row Zero supports a variety of large file formats including CSV, parquet, JSONL, etc. If you connect to your data source, you can easily preview and clean big files before importing to your database or data warehouse.
Challenges cleaning big data
For many folks, spreadsheets are the best tool for cleaning data, since you can easily assess what needs to be cleaned and quickly clean data without writing code. However, if you're working with large datasets, it can be a challenge to clean data in Excel and Google Sheets. Excel limits data to 1,048,576 rows and Google Sheets limits data to 10 million cells. Many teams use BI tools like Tableau or Power BI to analyze data but BI tools aren't good at processing or cleaning raw data, so a data preprocessing step is needed to prep data for Tableau, Power BI, etc. For more technical folks, it can be a challenge to clean data with code without first exploring and viewing the data to assess what needs to be cleaned.
Row Zero was designed to solve these challenges and is a great tool for cleaning big data. You can edit, analyze, and clean massive datasets in a familiar spreadsheet. Row Zero combines the big data power and connectivity of a BI tool with the built-in features and flexibility of a spreadsheet. If you need to clean a file with more than one million rows or clean database tables in a spreadsheet, Row Zero is your best choice.