Interview challenge with databases

As part of the interview process, I've been asked to solve this problem:

A railway company has to build a database to manage its booking system. The database has to store:

  • Users: id, name, email address, and city of residence.

  • Cities: id, name.

  • Rides: id, distance, fare, city_start, city_end

Create a table schema and write a query that selects the top 4 users that have traveled the most.

In today's post, I will solve this exercise using PostgreSQL, one of the most used relational databases.

Set up PostgreSQL

Instead of installing PostgreSQL on my laptop, I've decided to run it inside a container. Be sure to have Docker installed before proceeding.

To launch the PostgreSQL container type:

docker run -itd \
    -e POSTGRES_USER=user \
    -e POSTGRES_PASSWORD=yourpassword \
    -p 5432:5432 --name postgresql postgres

The -p option will bind the container port to the host port, in this way we can connect to the database using localhost. Feel free to change the user and password to connect.

Warning: The Docker container has been launched without an external volume, this means that all the data and the table structure are ephemeral and will be lost as soon as the container is stopped.

Install the PostgreSQL client on Ubuntu:

sudo apt install -y postgresql-client

Connect to the database

Use the following command to connect to the database, you will be asked to prompt the password:

psql -h localhost -U user

If you followed all the steps you should now be connected to the PostgreSQL and see the following prompt on the terminal:

psql (14.7 (Ubuntu 14.7-0ubuntu0.22.04.1), server 15.2 (Debian 15.2-1.pgdg110+1))
Type "help" for help.

user=#

Create the database schema

The first step of the exercise consists of creating a table schema for the railway company. Let's create the cities table first, copy the following code and paste it inside the psqlcommand line interface:

CREATE TABLE cities (
    id serial PRIMARY KEY,
    name VARCHAR ( 50 ) NOT NULL
);

The serial types create an autoincrement integer key, VARCHAR defines a string type of 50 characters to store the city name. To double-check if the table has been created type:

\dt

It will print all the tables in the database.

Let's move now to the users table schema:

CREATE TABLE users (
      id serial PRIMARY KEY,
       name VARCHAR ( 50 ) NOT NULL,
       email VARCHAR ( 255 ) UNIQUE NOT NULL,
    city_id INT NOT NULL,
    FOREIGN KEY (city_id)
        REFERENCES cities (id)
);

This schema includes also the so-called 'Foreign key constraints' defined with:

FOREIGN KEY (city_id) REFERENCES cities (id)

This instruction says that the field city_id is a reference to a city stored in the cities table. In other words, the city_id will store the id of a row in the cities table. The convenience of using the foreign key constraints is that the existence of the referenced city will be checked during the insertion of a new user: It is not possible to create a user that points to a non-existing entry in the city table. The constraint works also the other way around, It is not possible to delete a city in case it's referenced by a user. This mechanism helps maintain the integrity of the database and the information stored.

The rides table schema:

CREATE TABLE rides (
    id serial PRIMARY KEY,
    distance real NOT NULL,
    fare real NOT NULL,
    user_id INT NOT NULL,
    city_start INT NOT NULL,
    city_end INT NOT NULL,
    FOREIGN KEY (user_id)
        REFERENCES users (id),
    FOREIGN KEY (city_start)
        REFERENCES cities (id),
    FOREIGN KEY (city_end)
        REFERENCES cities (id)
);

As for the users table, there are three constraints: one on the user that purchased the ride and two that refer to the trip start and end.

The first part of the exercise is complete let's solve the latter now.

Populate the tables

Before building the query we need to fill the table with some data:

INSERT INTO cities (name) VALUES ('London'), ('Paris'), ('Berlin'), ('Amsterdam'), ('Brussel'), ('Zurich'), ('Vienna'), ('Rome'), ('Madrid');

INSERT INTO users (name, email, city_id) VALUES ('John', 'john@mail.com', 1), ('Adam', 'adam@mail.com', 1), ('Roland', 'roland
@mail.com', 4), ('Dan', 'dan@mail.com', 2), ('Franziska', 'franziska@mail.com', 2), ('veronika', 'veronika@mail.com', 2), ('donna', 'donna@mail.com', 2), ('kate', 'kate@mail.com', 2);

INSERT INTO rides (distance, fare, user_id, city_start, city_end) VALUES (200, 50, 1, 2, 3), (300, 80, 1, 2, 4), (400, 12, 2, 1, 5), (200, 12, 3, 2, 3), (200, 80, 4, 3, 2), (130, 70, 5, 1, 6), (450, 60, 5, 7, 9), (220, 55, 5, 2, 8), (230, 12, 5, 5, 4), (890, 56, 6, 3, 2), (432, 99, 6, 1, 3), (189, 12, 6, 4, 7), (349, 56, 6, 6, 8), (710, 89, 6, 5, 1), (634, 90, 6, 6, 3), (325, 23, 7, 7, 4), (541, 26, 8, 1, 2), (900, 19, 8, 3, 8), (821, 56, 8, 4, 7), (323, 55, 8, 5, 4), (120, 59, 8, 2, 6);

To check for the content of a table:

SELECT * FROM cities;

Write the query

The query that we have to write has to report the users that traveled the most with the train, this involves the use of the data stored on two tables.

Let's start small by calculating how much each user has traveled, this information can be retrieved by using the rides table:

SELECT user_id, SUM(distance) AS total_distance FROM rides GROUP BY user_id;
user_id | total_distance 
---------+----------------
       3 |            200
       5 |           1030
       4 |            200
       6 |           3204
       2 |            400
       7 |            325
       1 |            500
       8 |           2705

The query selects the user_id and calculates the quantity total_distance as a SUM() of the column distance. To get the count right for each user we must add a GROUP BY statement that isolates the summation on the rows that contain the same user_id.

Lets now add ORDER BY and LIMIT statement to limit the results to return only the users that traveled the most:

SELECT user_id, SUM(distance) AS total_distance FROM rides GROUP BY user_id ORDER BY total_distance DESC LIMIT 4;
 user_id | total_distance 
---------+----------------
       6 |           3204
       8 |           2705
       5 |           1030
       1 |            500

At this point, we might call the exercise done but we can make it better, what about showing the name of the user instead of the id?

That information is stored in the user table, we must then join the two tables and rewrite the query as follows:

SELECT users.name, SUM(rides.distance) AS tot_distance 
FROM rides 
INNER JOIN users ON users.id = rides.user_id 
GROUP BY users.id 
ORDER BY tot_distance DESC 
LIMIT 4;
   name    | tot_distance 
-----------+--------------
 veronika  |         3204
 kate      |         2705
 Franziska |         1030
 John      |          500

Conclusions

In this post, we have created a table schema for a railway company and wrote a query to retrieve the customers that travel the most. The challenge didn't specify the type of database to use and I opted for PostgreSQL.

What do you think? Did I pass the interview?

Did you find this article valuable?

Support Marco Boretto by becoming a sponsor. Any amount is appreciated!