SQL For Data Analysis — Pakistan’s Largest E-Commerce

14 min readMar 15, 2023

In today’s data-driven world, businesses are constantly seeking efficient ways to collect, manage, and analyze large amounts of data. This is where SQL comes into play. Structured Query Language (SQL) is a powerful tool used to manage and manipulate relational databases. It is the standard language used to interact with database management systems and is widely used by businesses to extract meaningful insights from their data.

SQL is a declarative language, meaning that users only need to specify what they want to achieve rather than how to achieve it. This makes it an efficient tool for querying and filtering large datasets, as users can simply input their desired parameters and SQL will do the rest. SQL is also highly versatile, offering a wide range of commands for tasks such as creating tables, inserting data, updating records, and deleting data.

One of the key benefits of SQL is its ability to perform complex joins and aggregations. This allows users to combine data from multiple tables or sources, aggregate it into meaningful insights, and derive valuable business intelligence. SQL is also highly scalable, making it suitable for large-scale data projects. Its ability to handle large datasets efficiently has made it a popular tool in industries such as finance, healthcare, and e-commerce.

Overall, SQL is an essential tool for anyone working with relational databases or seeking to derive insights from their data. Its simplicity, versatility, and scalability make it an indispensable asset in the world of data analytics.

SQL is used in data analysis, then how to analyze the data?

The process of data analysis is divided into several parts, namely:

1. Determine the problem

2. the problems that will be analyzed

3. Looking for appropriate data

4. Processing defective data

5. Perform data analysis

6. Interpret the results

From this process, SQL is used to perform data analysis.

In this paper I will explain SQL with an E-Commerce study case by answering business questions. The dataset used comes from Kaggle: Pakistan’s Largest E-Commerce Dataset with 3 datasets, namely order_detail, sku_detail and payment_detail. The first thing that needs to be done is to import the dataset into PostgreSQL. The three datasets have been set up by Bootcamp so you only need to run the given query to import the dataset.

Before answering the questions given, it is important to know what each dataset contains. details of the contents of the order_detail dataset as follows:

1. id → unique number of order / id_order

2. customer_id → unique number of the customer

3. order_date → the date when the transaction was made

4. sku_id → unique number of the product (sku is the stock keeping unit)

5. price → the price listed on the price tagging

6. qty_ordered → the number of items purchased by the customer

7. before_discount → total price value of the product (price * qty_ordered)

8. discount_amount → total product discount value

9. after_discount → the value of the total price of the product when it has been deducted by the discount

10. is_gross → indicates the customer has not paid for the order

11. is_valid → indicates the customer has made a payment

12. is_net → indicates the transaction has been completed

13. payment_id → a unique number of the payment method

detail the contents of the sku_detail dataset as follows:

1. id → a unique number from the product (can be used as a key when joining)

2. sku_name → the name of the product

3. base_price → the price of the item listed on the price tag

4. cogs → cost of goods sold / total cost to sell 1 product

5. category → product category

the detailed contents of the payment_detail dataset are as follows:

1. id → a unique number of the payment method

2. payment_method → payment method used

After knowing all the details of the dataset, analysis can be carried out.

Number 1

Q : During transactions that occurred during 2021, in what month was the total transaction value (after_discount) the largest? Use is_valid = 1 to filter transaction data.

To answer this question, you need to know which column to use. The data needed are:

1. Month, to classify transaction data

2. Total sales, to be grouped by month

These data are not available in all imported datasets, but by modifying the available columns, these data can be obtained. The month data is obtained by using ‘Month’ in the order_date column and the data type is converted to text, while the total sales can be obtained by adding up the entire after_discount column. The data is available in the order_detail dataset.

select
    to_char (order_date,'month') month_2021,
    sum (after_discount) total_sales
from order_detail
where
    is_valid =1
    and to_char (order_date,'yyyy-mm-dd')
    between '2021-01-01' and '2021-12-31'
group by 1
order by 2 desc

The data is conditioned by only displaying data at is_valid = 1 and the year 2021 by using order_date between January 1, 2021 and December 31, 2021. The SUM aggregate function needs to be combined with GROUP BY so that it adds up the after_discount for each month_2021. To display data in descending order, ORDER BY uses total_revenue in descending order.

The results obtained from the run query are as follows:

From the table it can be seen that the highest total transactions during 2021 were in November.

Number 2

Q : During the transactions that occurred during 2021, in which month did the total number of customers (unique), total orders (unique) and total product quantity be the highest? Use is_valid = 1 to filter transaction data.

To answer this question, you need to know which column to use. The data needed are:

Month
Total customers
Order totals
Total order quantity

Month data can be obtained by using ‘Month’ in the order_date column and the data type is converted to text. Total customers and total orders can be calculated by calculating the amount of data available, but it is necessary to use DISTINCT so that the calculated data is unique (no data duplication). The total quantity of orders can be made by adding up all available data.

select
    to_char (order_date,'month') month_2021,
    count (distinct od.customer_id) as jumlah_pelanggan,
    count (distinct od.id) as total_pesanan,
    sum (od.qty_ordered) as total_qty
from order_detail od
where
    is_valid =1
    and to_char (order_date,'yyyy-mm-dd')
    between '2021-01-01' and '2021-12-31'
group by 1
order by 2 desc

The data is conditioned by only displaying data at is_valid = 1 and the year 2021 using order_date between January 1, 2021, and December 31, 2021. The COUNT and SUM aggregate functions need to be combined with GROUP BY so that total_customer, total_order, and total_qty can be grouped for each month_2021. I wrote the number 1 in GROUP BY because month_2021 is in column 1, so it can be simplified by only writing the number 1. In ORDER BY I wrote the number 2 because I want to sort the data based on the total_customer column which is in the 2nd column and the sorting is done in the descendant way.

The results obtained from the run query are as follows:

From the table it can be seen that the highest total customers, total orders and total quantity during 2021 were in November.

Number 3

Q : Compare the transaction value of each category in 2021 with 2022. State which categories have increased and which categories have decreased transaction value from 2021 to 2022. Use is_valid = 1 to filter transaction data. During the transactions that occurred during 2022, which category generated the greatest transaction value? Use is_valid = 1 to filter transaction data.

To answer this question, the following data is needed:

1. Product category

2. Total transaction value

Product category data can be obtained from the sku_detail dataset, while total transaction value data can be obtained from the order_detail dataset by adding up the after_discount column. Because there are two datasets to get the data needed, JOIN is used to combine them.

select * from sku_detail sd
select * from order_detail od
select
    sd.category,
    sum (od.after_discount) total_discount
from order_detail od
left join sku_detail sd on sd.id = od.sku_id
where
    is_valid =1
    and to_char (order_date,'yyyy-mm-dd')
    between '2021-01-01' and '2021-12-31'
group by 1
order by 2 desc

The focus of question 3 is to display the categories and total transaction value in each category so that the selected columns are sku.category and total_revenue (obtained from the sum of after_discount using the SUM aggregate function). The data is conditioned by only showing data in 2022 and is_valid = 1. The aggregate function needs to be combined with GROUP BY so that it sums after_discount for each sku.category. To display data in descending order, ORDER BY uses total_revenue in descending order. The results obtained from the run query are as follows:

From the table it can be seen that the highest total revenue for 2022 is in the Mobiles & Tablets category.

Number 4

To answer this question, the following data is needed:

1. Product category

2. Total transaction value for 2021

3. Total transaction value for 2022

4. The difference between the total transaction value for 2021 and 2022

Product category data can be obtained from the sku_detail dataset, while other data requires more in-depth processing which can be obtained from the order_detail dataset.

In this query I use CTEs to make it easier for me to extract the year_order data and combine the order_detail dataset with the sku_detail dataset.

select * from sku_detail sd
select * from order_detail od

--CTE
with a as (
select
    sd.category,
    sum (case when to_char(order_date,'yyyy-mm-dd')
    between '2021-01-01' and '2021-12-31'
    then od.after_discount end) total_sales_2021,
    sum (case when to_char(order_date,'yyyy-mm-dd')
    between '2022-01-01' and '2022-12-31'
    then od.after_discount end) total_sales_2022
from order_detail od
left join sku_detail sd on sd.id = od.sku_id
where is_valid = 1                                                                                                                                                                                                                                                                                                    
group by 1
order by 2 desc )
select a.*, total_sales_2022 - total_sales_2021 growth_value
from a
order by 4 desc

and if the query is executed, the result will be as follows:

From the table it can be seen that the categories that have decreased in transaction value from 2021 to 2022 are:

1. Others

2. Soghaat

3. Books

4. Men’s Fashion

While those that experienced an increase in transaction value were:

1. Mobile & Tablets

2.Women Fashion

3.Entertainment

4. Appliances

5.Superstores

6. Computing

7.Kids & Babies

8. Beauty & Grooming

9. Health & Sports

10. Home & Living

11. Schools & Education

Number 5

Q: Display the Top 10 sku_name (along with their categories) based on the value of transactions that occurred during 2022. Also display the total number of customers (unique), total orders (unique) and total total quantity. Use is_valid = 1 to filter transaction data.

To answer this data we need to know what table we need first

1.order_detail

2.sku_detail

Ater we know,both of table that we need ,we must to use the logical syntax/query to find the right answer,the query command is in the picture

select
    sd.sku_name,
    sd.category,
    sum (od.after_discount) total_sales,
    count (distinct od.customer_id) as total_pelanggan,
    count (distinct od.id) as total_order,
    sum (od.qty_ordered) as total_qty
from order_detail od
left join sku_detail sd on sd.id = od.sku_id
where
    is_valid =1
    and to_char (order_date,'yyyy-mm-dd')
    between '2022-01-01' and '2022-12-31'
group by 1,2
order by 3 desc
limit 10

We can see from the pictures ,we all those syntax we can find the right answer like this :

After the result, we know what is the Top 10 on the list

Number 6

Q : Show the top 5 most popular payment methods used in 2022 (based on total unique orders). Use is_valid = 1 to filter transaction data.

Source table: order_detail, payment_method

To answer this question, you need to know which column to use. The table we needed are:

1.order_detail

2.Payment_method

And then we can search 5 most popular payment methods used in 2022 with the syntax/query like this :

select
    pd.payment_method,
    count (distinct od.id) as total_order
from order_detail od
left join payment_detail pd on pd.id = od.payment_id
where
    is_valid =1
    and to_char (order_date,'yyyy-mm-dd')
    between '2022-01-01' and '2022-12-31'
group by 1
order by 2 desc
limit 5

To start with pd.payment method and limit it to 5 like the question and then we can find the right answer like this.

There you go,now we know what is the top 5 on the table.

Number 7

Q: Sort these 5 products based on their transaction value.

1. Samsung

2. Apples

3. Sony

4.Huawei

5.Lenovo

Use is_valid = 1 to filter transaction data.

Source table: order_detail, sku_detail

follow this source code:

WITH a AS (
SELECT
  id,
  (CASE
     WHEN LOWER(sku_name) LIKE '%samsung%' THEN 'Samsung'
     WHEN LOWER(sku_name) LIKE '%apple%' OR LOWER(sku_name) LIKE '%iphone%' THEN 'Apple'
     WHEN LOWER(sku_name) LIKE '%sony%' THEN 'Sony'
     WHEN LOWER(sku_name) LIKE '%huawei%' THEN 'Huawei'
     WHEN LOWER(sku_name) LIKE '%lenovo%' THEN 'Lenovo'
  END) AS brand
FROM
  sku_detail
)
SELECT
  sku.brand,
ROUND(SUM(ord.after_discount)) AS total_revenue
FROM
  order_detail AS ord
LEFT JOIN
  a AS sku
  ON ord.sku_id = sku.id
WHERE
  is_valid = 1 AND
  brand IS NOT NULL
GROUP BY
  sku.brand
ORDER BY
  total_revenue DESC;

This code is a SQL query that aggregates data to calculate total revenue for each product brand in the “sku_detail” table and sorts them from the highest revenue brand to the lowest.

1. First, the code uses a subquery named “a” that projects the “id” and “brand” columns from the “sku_detail” table. The “brand” column will have brand values based on conditions outlined in several “CASE” statements using the “LOWER” function to make the string in lowercase, so it is not affected by capitalization differences.

2. Next, the code joins the “order_detail” table with the “a” subquery using the product ID (sku_id) as the join key. The “order_detail” table stores information about each order, including the product ID, and the revenue amount after discount (the “after_discount” column).

3. The code then filters the joined result to ensure only valid data (is_valid = 1) and non-null brands are taken.

4. The code then aggregates the data by calculating the total revenue (using the “SUM” function) for each product brand and rounds it to the nearest integer (using the “ROUND” function).

5. The final result is then sorted based on the calculated brand total revenue, starting from the highest to the lowest (using the “ORDER BY” function in descending order).

Therefore, this code can help to obtain information about the product brand that generates the most revenue from the “order_detail” and “sku_detail” tables.

The result from the code:

Number 8

Q : Like question no. 3, make a comparison of the profit values for 2021 and 2022 for each category. Then make the difference % profit difference between 2021 and 2022 (profit = after_discount — (cogs*qty_ordered)) use is_valid = 1 to filter transaction data.

Source table: order_detail, sku_detail

follow the code for this question:

with b as (
with a as (

select
    od.id,
    to_char(od.order_date,'yyyy')year_order,
    sd.category,
    od.after_discount - (sd.cogs*od.qty_ordered) profit
from order_detail od
left join sku_detail sd on sd.id = od.sku_id
where
    is_valid =1)
select
    a.category,
    sum (case when year_order = '2021' then a.profit end) profit_2021,
    sum (case when year_order = '2022' then a.profit end) profit_2022
from a
group by 1)
select
    b.*,
    (b.profit_2022-b.profit_2021)/b.profit_2021 growth
from b
order by 4 desc

This code is a SQL query that calculates the profit and profit growth rate for each product category between 2021 and 2022 based on the data stored in the “order_detail” and “sku_detail” tables.

1. The code uses a subquery “a” that selects the “id”, “year_order”, “category”, and “profit” columns by joining the “order_detail” and “sku_detail” tables. The “profit” column is calculated by subtracting the cost of goods sold (cogs) multiplied by the quantity ordered from the “after_discount” column. The year of the order date is extracted using the “to_char” function.

2. The subquery “a” is then used in another subquery “b” to group the data by product category and calculate the total profit for each category in 2021 and 2022 using the “sum” function with “case” statements. The result is grouped by the product category.

3. Finally, the main query calculates the growth rate by subtracting the 2021 profit from the 2022 profit and dividing the result by the 2021 profit for each category. The result is ordered in descending order based on the growth rate.

In summary, this code can help to identify the product categories that have the highest profit growth rate between 2021 and 2022 based on the “order_detail” and “sku_detail” tables.

The result from the code:

Number 9

Q : Show the top 5 SKU with the highest profit contribution in 2022 based on the categories with the most profit growth from 2021 to 2022 (based on result no H). Use is_valid = 1 to filter transaction data.

we need product name and total profit. the product name is obtained from the sku_detail table and the total product is obtained from combining the sku_detail and order_detail tables

we will display the top 5 profit orders use this query:

WITH tab_profit AS (
SELECT
 ord.id,
 sku.sku_name,
 ord.after_discount - (sku.cogs * ord.qty_ordered) AS profit
FROM
 order_detail AS ord
LEFT JOIN
 sku_detail AS sku
 ON sku.id = ord.sku_id
WHERE
 is_valid = 1 AND
 order_date BETWEEN '2022-01-01' AND '2022-12-31' AND
 sku.category = 'Women Fashion')
SELECT
 sku_name,
 SUM(profit) AS total_profit
FROM
 tab_profit
GROUP BY
 sku_name
ORDER BY
 total_profit DESC
LIMIT 5;

Output from the query is top 5 product according total profit 2022

Number 10

Q : Show the number of unique orders using the top 5 payment methods (question no F) by product category in 2022. Use is_valid = 1 to filter transaction data.

First we need product category and payment method used by customer. Category product can get from table sku_detail and payment method get from join table order_detail and payment_detail.

We can use this query:

ELECT
 sku.category,
 COUNT(DISTINCT CASE WHEN pay.payment_method = 'cod' THEN ord.id END) AS cod,
 COUNT(DISTINCT CASE WHEN pay.payment_method = 'Easypay' THEN ord.id END) AS easypay,
 COUNT(DISTINCT CASE WHEN pay.payment_method = 'Payaxis' THEN ord.id END) AS payaxis,
 COUNT(DISTINCT CASE WHEN pay.payment_method = 'customercredit' THEN ord.id END) AS customercredit,
 COUNT(DISTINCT CASE WHEN pay.payment_method = 'jazzwallet' THEN ord.id END) AS jazzwallet
FROM
 order_detail AS ord
LEFT JOIN
 payment_detail pay
 ON pay.id = ord.payment_id
LEFT JOIN
 sku_detail sku
 ON sku.id = ord.sku_id
WHERE
 is_valid = 1 AND
 order_date BETWEEN '2022-01-01' AND '2022-12-31'
GROUP BY
 sku.category
ORDER BY
 cod DESC
Limit 5;

Output from the query show top 5 payment method filter by COD. Men Fashion is the highest product whose payment uses COD.

In this article, we have discussed 10 common questions about SQL for data analysis, ranging from how to import data into a database to querying. From the discussions above, we can conclude that SQL can help us efficiently and effectively process large amounts of data, allowing us to make better business decisions. Furthermore, SQL can be used to answer specific questions that arise during data analysis. Keep learning and practicing to master SQL and improve your data analysis skills. We hope this article has been helpful for you!