Unstructured Data Teks with SQL — Mini Project and Study Case
Sometimes, we got sample data that has errors. Typos, or incorrect writing according to the rules are the most common errors. It would take a long time to fix them, but we can quickly solve it with “REGEX” notation.
In this case, we use sample data from DQLab. Look the table below.
The table indicates the same province, which is Yogyakarta but due to the use of free text format, customers are free to write Yogya, Jogja, Jogjakarta, D.I. Yogyakarta, and Yogyakarta.
When conducting data analysis processes such as aggregation, this will inevitably result in different analytical values if the writing is not standardized. Therefore, it is crucial to ensure uniformity in the writing. We can use Regular Expression (REGEX) to solve it.
Regular Expression or REGEX is a text string that enables you to create patterns that assist in matching, searching, and managing text data types.
You can read the documentation about REGEX on this link https://dev.mysql.com/doc/refman/8.0/en/regexp.html
The syntax REGEXP in general is:
SELECT * FROM nama_tabel WHERE nama_kolom REGEXP 'argumen'
CASE 1. Display the data in the ‘dqlabregex’ table with the ‘kota’ column ending in ‘ing’. Follow the syntax below:
SELECT * FROM dqlabregex WHERE kota REGEXP 'ng$'
and the resulting output is as follows
The regex notation ‘$’ is used to match the end of a character. So, the notation ‘ng$’ indicates to retrieve text that ends with ‘ng’, which matches the texts Bandung and Semarang.
CASE 2. To display all the data by filtering the name of the recording in the “staff_pencatat” column in the “dqlabregex” table with the name Senja, Sendja, SenDja, Sen_ja and similiar variations.
Use the syntax below to solve case 2.
SELECT * FROM dqlabregex WHERE staf_pencatat REGEXP 'Sen.?ja'
and the resulting output is as follows
CASE 3. In the “dqlabregex” table, the column “jumlah_member” represents a quantity, meaning that only numeric values are allowed as the content of the rows (data records). However, there are input or typing errors in the table, resulting in non-numeric characters in the “jumlah_member” column.
You are asked to check and display input data errors in that column. Create a query that displays all data with input errors in the “jumlah_member” column. Follow the syntax below:
SELECT * FROM dqlabregex WHERE jumlah_member REGEXP '[^0-9]'
And the resulting output is as follows
Like to highlight that REGEXP is case-sensitive, meaning that uppercase and lowercase letters are considered different characters. In the words, it is sensitive to the letter’s case.9
REGEXP_LIKE is a query in SQL that functions similarly to the like query (a query used in the WHERE clause for conditionally searching for specific patterns in a column).
There is a slight difference with regex. While regex is case-sensitive, the REGEXP_LIKE function has an option — even though it is case-sensitive by default — where we can initialize a parameter to change it to ignore case sensitivity or keep sensitivity.
In general, the syntax for REGEXP_LIKE is:
SELECT * FROM table_name WHERE REGEXP_LIKE (column, 'argument', match_parameter)
CASE 4. Search for the name “staff_pencatat” in a table that starts with ‘an’ while ignoring case sensitivity. We can use the following query:
SELECT * FROM dqlabregex WHERE REGEXP_LIKE(staf_pencatat, '^AN','i')
And the resulting output is as follows
The notation ‘AN’ means to search for the prefix ‘AN’ in the string. Adding the ‘i’ character to the match_parameter means that case-sensitivity is ignored, allowing the words “Andra” and “Antara”.
CASE 5. Display all data by filtering the name of the recording officer in the “staff_pencatat” in the “dqlabregex” table with the names SenDja or Sen_ja while ignoring case sensitivity.
SELECT * FROM dqlabregex WHERE REGEXP_LIKE (staf_pencatat,'sen.?ja', 'i')
And the resulting output is as follows
CASE 6. To check and display input errors in the “jumlah_member” column. We can use the syntax below:
SELECT * FROM dqlabregex WHERE REGEXP_LIKE( jumlah_member, '[^0-9]', 'i')
And the resulting output is as follows
There is a non-numeric character (letter ‘o’) observed in the “jumlah_member” column.
REGEXP_REPLACE is a SQL query that functions similarly to the replace query (a query used to replace specific strings/characters in a column with other strings/characters) but with the additional of supporting regex notation.
The general syntax for REGEXP_REPLACE in SQL is as follows:
SELECT REGEXP_REPLACE (column, 'string you want to change', 'new string') FROM table
CASE 7. Replace the text ‘Sendja’ with ‘Senja’ in the “staff_pencatat” column. Use regex notation so that if there any matching texts to ‘Senja’ in the future, such as ‘Sen_ja’, ‘Sen ja’ and so on they will be changed to a single text, ‘Senja’. Then change the column name to “pencatat using the alias query (AS)
Use the syntax below to solve case 7.
SELECT REGEXP_REPLACE(staf_pencatat,'Sen.?ja', 'Senja') AS pencatat FROM dqlabregex
And the resulting output is as follows
CASE 8. Delete non-numeric characters in “jumlah_member” column and then, show all columns in the “dqlabregex” table. Use the syntax below:
SELECT no_pencatatan, tanggal_catat, kota, REGEXP_REPLACE(jumlah_member, '[^0-9]',"") AS jumlah_member, staf_pencatat
FROM dqlabregex
and the resulting output is as follows
CASE 9. To standardize the date format from DD-MM-YYYY to MM/DD/YYYY for consistency in that column and change the column name to “tanggal_pencatatan”. Display the original column and the column after standardization.
Use the syntax below
SELECT tanggal_catat, REGEXP_REPLACE(tanggal_catat, '([0-9]{2})-([0-9]{2})-([0-9]{4})', '$2/$1/$3') AS tanggal_pencatatan
FROM dqlabregex
and the resulting output is as follows
MINI PROJECT 1. Clean All Columns.
Performing record replacement in each column if the dqlabregex table by following several available notes.
- In the “tanggal_catat” column, change all dates formats to a date format supported by SQL, cush as the YYYY-MM-DD format
- Delete non-numeric characters in “jumlah_member” column
- Change the record that contain Sendja or its variations to Senja
For solve the mini project 1, we can use the syntax below
SELECT no_pencatatan,
CASE
WHEN REGEXP_LIKE(tanggal_catat, '([0-9]{2})-([0-9]{2})-([0-9]{4})')
THEN REGEXP_REPLACE(tanggal_catat, '([0-9]{2})-([0-9]{2})-([0-9]{4})', '$3-$2-$1')
ELSE
REGEXP_REPLACE(tanggal_catat, '([0-9]{2})/([0-9]{2})/([0-9]{4})', '$3-$1-$2')
END AS tanggal_catat,
kota,
REGEXP_REPLACE(jumlah_member, '[^0-9]', '') AS jumlah_member,
REGEXP_REPLACE(staf_pencatat, 'Sen.?ja','Senja') AS staf_pencatat
FROM dqlabregex;
and the resulting output is as follows
MINI PROJECT 2. Aggregate.
To determine the total members records by each staff recording, which is the sum of the “jumlah_member” column group by “staff_pencatat” and then sort it from smallest to largest. Please note that data cleaning should be performed first by removing non-numeric characters in the “jumlah_member” column and replacing the text ‘Sendja’ with ‘Senja’ using regex notation.
To solve mini project 2, we can use the following syntax:
SELECT
SUM(REGEXP_REPLACE(jumlah_member, '[^0-9]', '')) AS total_member,
REGEXP_REPLACE(staf_pencatat, 'Sen.?ja', 'Senja') AS staf_pencatat
FROM dqlabregex
GROUP BY 2 ORDER BY 1;
and the resulting output is as follows
Above is the REGEX material to handle unstructured text issues in SQL. Hopefully it can assist you in learning SQL queries. Keep up the learning spirit!