My past few positions have had me using a lot of Python, R, Docker, Airflow etc., and not as much pure SQL for analysis. As a result, my SQL skills are a little rusty. So I decided to create a development environment for SQL in order to practice. In this quick post I’ll outline how I setup a PostgreSQL environment in a docker container. There are plenty of practice environments online, but I want to have certain data tables available to practice analyzing business data specifically (MAU, ARPU, Revenue, etc.).
Once finished We’ll be able to connect, to the db, and query like below:
library(DBI)
library(RPostgres)
con <- dbConnect(RPostgres::Postgres(), host="localhost",
port=5432, user="anton", password="gilfoyle", dbname="piper")
select * from orders limit 5
| order_date | user_id | order_id | meal_id | order_quantity |
|---|---|---|---|---|
| 2018-06-01 | 0 | 0 | 14 | 2 |
| 2018-06-01 | 0 | 0 | 15 | 1 |
| 2018-06-01 | 0 | 0 | 17 | 1 |
| 2018-06-01 | 1 | 1 | 16 | 2 |
| 2018-06-01 | 1 | 1 | 17 | 1 |
Let’s get started with a simple docker container using Docker Compose. Create a docker-compose.yml file with the below.
version: '3.8'
services:
db:
image: postgres:16-alpine # Or a specific version like postgres:14-alpine
restart: always
environment:
POSTGRES_DB: piper
POSTGRES_USER: anton
POSTGRES_PASSWORD: gilfoyle
ports:
- "5432:5432" # Maps host port 5432 to container port 5432
volumes:
- db_data:/var/lib/postgresql/data # Persists data in a named volume
- ./csv:/csv
volumes:
db_data: # Define the named volume
This installs postgres16, creates the environment and mounts a few volumes.
We can start the container by:
docker-compose up --build
To build and stand-up the image from a terminal window.
make sure to have postgres and libpq installed using your preferred method (here homebrew):
brew install postgresql
brew install libpq
Then open an new terminal in our working directory and issue:
psql -h localhost -U anton -d piper
This will open psql and allow us to execute sql commands. You should see piper=# as a prompt.
As a Hello World we can create a simple table. Paste in the code below:
CREATE TABLE mtcars (
car_name VARCHAR(50) PRIMARY KEY,
mpg NUMERIC,
cyl INT,
disp NUMERIC,
hp INT,
drat NUMERIC,
wt NUMERIC,
qsec NUMERIC,
vs INT,
am INT,
gear INT,
carb INT
);
INSERT INTO mtcars
(car_name, mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb)
VALUES
('Mazda RX4', 21, 6, 160, 110, 3.9, 2.62, 16.46, 0, 1, 4, 4),
('Mazda RX4 Wag', 21, 6, 160, 110, 3.9, 2.875, 17.02, 0, 1, 4, 4),
('Datsun 710', 22.8, 4, 108, 93, 3.85, 2.32, 18.61, 1, 1, 4, 1),
('Hornet 4 Drive', 21.4, 6, 258, 110, 3.08, 3.215, 19.44, 1, 0, 3, 1),
('Hornet Sportabout', 18.7, 8, 360, 175, 3.15, 3.44, 17.02, 0, 0, 3, 2),
('Valiant', 18.1, 6, 225, 105, 2.76, 3.46, 20.22, 1, 0, 3, 1),
('Duster 360', 14.3, 8, 360, 245, 3.21, 3.57, 15.84, 0, 0, 3, 4),
('Merc 240D', 24.4, 4, 146.7, 62, 3.69, 3.19, 20, 1, 0, 4, 2),
('Merc 230', 22.8, 4, 140.8, 95, 3.92, 3.15, 22.9, 1, 0, 4, 2),
('Merc 280', 19.2, 6, 167.6, 123, 3.92, 3.44, 18.3, 1, 0, 4, 4),
('Merc 280C', 17.8, 6, 167.6, 123, 3.92, 3.44, 18.9, 1, 0, 4, 4),
('Merc 450SE', 16.4, 8, 275.8, 180, 3.07, 4.07, 17.4, 0, 0, 3, 3),
('Merc 450SL', 17.3, 8, 275.8, 180, 3.07, 3.73, 17.6, 0, 0, 3, 3),
('Merc 450SLC', 15.2, 8, 275.8, 180, 3.07, 3.78, 18, 0, 0, 3, 3),
('Cadillac Fleetwood', 10.4, 8, 472, 205, 2.93, 5.25, 17.98, 0, 0, 3, 4),
('Lincoln Continental', 10.4, 8, 460, 215, 3, 5.424, 17.82, 0, 0, 3, 4),
('Chrysler Imperial', 14.7, 8, 440, 230, 3.23, 5.345, 17.42, 0, 0, 3, 4),
('Fiat 128', 32.4, 4, 78.7, 66, 4.08, 2.2, 19.47, 1, 1, 4, 1),
('Honda Civic', 30.4, 4, 75.7, 52, 4.93, 1.615, 18.52, 1, 1, 4, 2),
('Toyota Corolla', 33.9, 4, 71.1, 65, 4.22, 1.835, 19.9, 1, 1, 4, 1),
('Toyota Corona', 21.5, 4, 120.1, 97, 3.7, 2.465, 20.01, 1, 0, 3, 1),
('Dodge Challenger', 15.5, 8, 318, 150, 2.76, 3.52, 16.87, 0, 0, 3, 2),
('AMC Javelin', 15.2, 8, 304, 150, 3.15, 3.435, 17.3, 0, 0, 3, 2),
('Camaro Z28', 13.3, 8, 350, 245, 3.73, 3.84, 15.41, 0, 0, 3, 4),
('Pontiac Firebird', 19.2, 8, 400, 175, 3.08, 3.845, 17.05, 0, 0, 3, 2),
('Fiat X1-9', 27.3, 4, 79, 66, 4.08, 1.935, 18.9, 1, 1, 4, 1),
('Porsche 914-2', 26, 4, 120.3, 91, 4.43, 2.14, 16.7, 0, 1, 5, 2),
('Lotus Europa', 30.4, 4, 95.1, 113, 3.77, 1.513, 16.9, 1, 1, 5, 2),
('Ford Pantera L', 15.8, 8, 351, 264, 4.22, 3.17, 14.5, 0, 1, 5, 4),
('Ferrari Dino', 19.7, 6, 145, 175, 3.62, 2.77, 15.5, 0, 1, 5, 6),
('Maserati Bora', 15, 8, 301, 335, 3.54, 3.57, 14.6, 0, 1, 5, 8),
('Volvo 142E', 21.4, 4, 121, 109, 4.11, 2.78, 18.6, 1, 1, 4, 2)
;
MTCARS is a famous sample data set. Now we can query it.
select * from mtcars limit 5
| car_name | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
| Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
| Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
| Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
| Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
I won’t go into great detail on how to load all the tables but here is a quick example.
A good way to populate some sample data is to curl some files exec int the container, copy the files. From the terminal:
curl "https://assets.datacamp.com/production/repositories/4016/datasets/606e6e9165c25477db078996fa7e0a3e994b93d3/orders.csv" > orders.csv
docker exec -it <container-id> bash
cp orders.csv var/lib/postgresql/.
Now create an orders table in the container using psql like:
DROP TABLE IF EXISTS "orders";
CREATE TABLE orders (
order_date DATE,
user_id INT,
order_id INT,
meal_id INT,
order_quantity INT
);
Now use the psql COPY command to copy the csv into the orders table
COPY orders FROM '../orders.csv' DELIMITER ',' CSV HEADER;
Now you have a nice little development environment to practice SQL based data analysis. Enjoy!
Let’s say you want to take a look at the change in frequent vs infrequent customers over time. We’ll user data from a fictional food delivery service. We’ll define “frequent” users as those who place 4 or more orders a month and infrequent as those who placed less. Let’s look at the sum of orders for both groups to see how many orders each place in total each month (i.e. How many total orders each group contributes to the total).
-- frequent vs infrequent
with order_count_by_user as (
select
user_id,
DATE_TRUNC('month', order_date)::DATE as delivr_month,
COUNT(distinct order_id) as order_count
from orders
group by delivr_month, user_id
),
frequent_users as (
select
user_id,
delivr_month,
order_count
from order_count_by_user
where order_count >= 4
order by user_id
),
sum_frequent as (
select
frequent_users.delivr_month,
SUM(frequent_users.order_count) as order_sum_frequent
from frequent_users
group by frequent_users.delivr_month
order by frequent_users.delivr_month
),
infrequent_users as (
select
user_id,
delivr_month,
order_count
from order_count_by_user
where order_count < 4
order by user_id
),
sum_infrequent as (
select
infrequent_users.delivr_month,
SUM(infrequent_users.order_count) as order_sum_infrequent
from infrequent_users
group by infrequent_users.delivr_month
order by infrequent_users.delivr_month
)
select
sum_frequent.delivr_month,
order_sum_frequent,
order_sum_infrequent
from sum_frequent
inner join sum_infrequent
on sum_frequent.delivr_month = sum_infrequent.delivr_month;
| delivr_month | order_sum_frequent | order_sum_infrequent |
|---|---|---|
| 2018-06-01 | 74 | 208 |
| 2018-07-01 | 95 | 350 |
| 2018-08-01 | 122 | 548 |
| 2018-09-01 | 214 | 791 |
| 2018-10-01 | 505 | 1044 |
| 2018-11-01 | 1198 | 1386 |
| 2018-12-01 | 3472 | 1344 |
Now let’s calculate the percent of users that are frequent vs infrequent, by count of users, to see how this is trending.
-- percent frequent users by count
with order_count_by_user as (
select
user_id,
DATE_TRUNC('month', order_date)::DATE as delivr_month,
COUNT(distinct order_id) as order_count
from orders
group by delivr_month, user_id
),
frequent_users as (
select
user_id,
delivr_month,
order_count
from order_count_by_user
where order_count >= 4
order by user_id
),
count_frequent as (
select
frequent_users.delivr_month,
COUNT(distinct frequent_users.user_id) as user_count_frequent
from frequent_users
group by frequent_users.delivr_month
order by frequent_users.delivr_month
),
infrequent_users as (
select
user_id,
delivr_month,
order_count
from order_count_by_user
where order_count < 4
order by user_id
),
count_infrequent as (
select
infrequent_users.delivr_month,
COUNT(distinct infrequent_users.user_id) as user_count_infrequent
from infrequent_users
group by infrequent_users.delivr_month
order by infrequent_users.delivr_month
),
all_users as (
select
user_id,
delivr_month,
order_count
from order_count_by_user
order by user_id
),
count_all_users as (
select
all_users.delivr_month,
COUNT(distinct all_users.user_id) as user_count_total
from all_users
group by all_users.delivr_month
order by all_users.delivr_month
)
select
count_frequent.delivr_month,
user_count_frequent,
user_count_infrequent,
user_count_total,
ROUND(user_count_frequent / user_count_total::NUMERIC, 3)
as perc_users_frequent
from count_frequent
inner join count_infrequent
on count_frequent.delivr_month = count_infrequent.delivr_month
inner join
count_all_users
on count_frequent.delivr_month = count_all_users.delivr_month;
| delivr_month | user_count_frequent | user_count_infrequent | user_count_total | perc_users_frequent |
|---|---|---|---|---|
| 2018-06-01 | 17 | 106 | 123 | 0.138 |
| 2018-07-01 | 22 | 204 | 226 | 0.097 |
| 2018-08-01 | 28 | 309 | 337 | 0.083 |
| 2018-09-01 | 48 | 441 | 489 | 0.098 |
| 2018-10-01 | 113 | 576 | 689 | 0.164 |
| 2018-11-01 | 257 | 687 | 944 | 0.272 |
| 2018-12-01 | 668 | 599 | 1267 | 0.527 |
The ratio of frequent to infrequent customers is increasing over time. As well, total orders and customer counts are also growing. These are good signs of a healthy service.
… to be continued