Interview challenge with databases
As part of the interview process, I've been asked to solve this problem:
A railway company has to build a database to manage its booking system. The database has to store:
Users: id, name, email address, and city of residence.
Cities: id, name.
Rides: id, distance, fare, city_start, city_end
Create a table schema and write a query that selects the top 4 users that have traveled the most.
In today's post, I will solve this exercise using PostgreSQL, one of the most used relational databases.
Set up PostgreSQL
Instead of installing PostgreSQL on my laptop, I've decided to run it inside a container. Be sure to have Docker installed before proceeding.
To launch the PostgreSQL container type:
docker run -itd \
-e POSTGRES_USER=user \
-e POSTGRES_PASSWORD=yourpassword \
-p 5432:5432 --name postgresql postgres
The -p
option will bind the container port to the host port, in this way we can connect to the database using localhost
. Feel free to change the user and password to connect.
Warning: The Docker container has been launched without an external volume, this means that all the data and the table structure are ephemeral and will be lost as soon as the container is stopped.
Install the PostgreSQL client on Ubuntu:
sudo apt install -y postgresql-client
Connect to the database
Use the following command to connect to the database, you will be asked to prompt the password:
psql -h localhost -U user
If you followed all the steps you should now be connected to the PostgreSQL and see the following prompt on the terminal:
psql (14.7 (Ubuntu 14.7-0ubuntu0.22.04.1), server 15.2 (Debian 15.2-1.pgdg110+1))
Type "help" for help.
user=#
Create the database schema
The first step of the exercise consists of creating a table schema for the railway company. Let's create the cities
table first, copy the following code and paste it inside the psql
command line interface:
CREATE TABLE cities (
id serial PRIMARY KEY,
name VARCHAR ( 50 ) NOT NULL
);
The serial
types create an autoincrement integer key, VARCHAR
defines a string type of 50 characters to store the city name. To double-check if the table has been created type:
\dt
It will print all the tables in the database.
Let's move now to the users
table schema:
CREATE TABLE users (
id serial PRIMARY KEY,
name VARCHAR ( 50 ) NOT NULL,
email VARCHAR ( 255 ) UNIQUE NOT NULL,
city_id INT NOT NULL,
FOREIGN KEY (city_id)
REFERENCES cities (id)
);
This schema includes also the so-called 'Foreign key constraints' defined with:
FOREIGN KEY (city_id) REFERENCES cities (id)
This instruction says that the field city_id
is a reference to a city stored in the cities table. In other words, the city_id
will store the id of a row in the cities
table. The convenience of using the foreign key constraints is that the existence of the referenced city will be checked during the insertion of a new user: It is not possible to create a user that points to a non-existing entry in the city table. The constraint works also the other way around, It is not possible to delete a city in case it's referenced by a user. This mechanism helps maintain the integrity of the database and the information stored.
The rides
table schema:
CREATE TABLE rides (
id serial PRIMARY KEY,
distance real NOT NULL,
fare real NOT NULL,
user_id INT NOT NULL,
city_start INT NOT NULL,
city_end INT NOT NULL,
FOREIGN KEY (user_id)
REFERENCES users (id),
FOREIGN KEY (city_start)
REFERENCES cities (id),
FOREIGN KEY (city_end)
REFERENCES cities (id)
);
As for the users
table, there are three constraints: one on the user that purchased the ride and two that refer to the trip start and end.
The first part of the exercise is complete let's solve the latter now.
Populate the tables
Before building the query we need to fill the table with some data:
INSERT INTO cities (name) VALUES ('London'), ('Paris'), ('Berlin'), ('Amsterdam'), ('Brussel'), ('Zurich'), ('Vienna'), ('Rome'), ('Madrid');
INSERT INTO users (name, email, city_id) VALUES ('John', 'john@mail.com', 1), ('Adam', 'adam@mail.com', 1), ('Roland', 'roland
@mail.com', 4), ('Dan', 'dan@mail.com', 2), ('Franziska', 'franziska@mail.com', 2), ('veronika', 'veronika@mail.com', 2), ('donna', 'donna@mail.com', 2), ('kate', 'kate@mail.com', 2);
INSERT INTO rides (distance, fare, user_id, city_start, city_end) VALUES (200, 50, 1, 2, 3), (300, 80, 1, 2, 4), (400, 12, 2, 1, 5), (200, 12, 3, 2, 3), (200, 80, 4, 3, 2), (130, 70, 5, 1, 6), (450, 60, 5, 7, 9), (220, 55, 5, 2, 8), (230, 12, 5, 5, 4), (890, 56, 6, 3, 2), (432, 99, 6, 1, 3), (189, 12, 6, 4, 7), (349, 56, 6, 6, 8), (710, 89, 6, 5, 1), (634, 90, 6, 6, 3), (325, 23, 7, 7, 4), (541, 26, 8, 1, 2), (900, 19, 8, 3, 8), (821, 56, 8, 4, 7), (323, 55, 8, 5, 4), (120, 59, 8, 2, 6);
To check for the content of a table:
SELECT * FROM cities;
Write the query
The query that we have to write has to report the users that traveled the most with the train, this involves the use of the data stored on two tables.
Let's start small by calculating how much each user has traveled, this information can be retrieved by using the rides
table:
SELECT user_id, SUM(distance) AS total_distance FROM rides GROUP BY user_id;
user_id | total_distance
---------+----------------
3 | 200
5 | 1030
4 | 200
6 | 3204
2 | 400
7 | 325
1 | 500
8 | 2705
The query selects the user_id
and calculates the quantity total_distance
as a SUM()
of the column distance. To get the count right for each user we must add a GROUP BY
statement that isolates the summation on the rows that contain the same user_id.
Lets now add ORDER BY
and LIMIT
statement to limit the results to return only the users that traveled the most:
SELECT user_id, SUM(distance) AS total_distance FROM rides GROUP BY user_id ORDER BY total_distance DESC LIMIT 4;
user_id | total_distance
---------+----------------
6 | 3204
8 | 2705
5 | 1030
1 | 500
At this point, we might call the exercise done but we can make it better, what about showing the name of the user instead of the id
?
That information is stored in the user
table, we must then join the two tables and rewrite the query as follows:
SELECT users.name, SUM(rides.distance) AS tot_distance
FROM rides
INNER JOIN users ON users.id = rides.user_id
GROUP BY users.id
ORDER BY tot_distance DESC
LIMIT 4;
name | tot_distance
-----------+--------------
veronika | 3204
kate | 2705
Franziska | 1030
John | 500
Conclusions
In this post, we have created a table schema for a railway company and wrote a query to retrieve the customers that travel the most. The challenge didn't specify the type of database to use and I opted for PostgreSQL.
What do you think? Did I pass the interview?