I spent some time going over the Postgres schema of Gitlab. GitLab is an alternative to Github. You can self host GitLab since it is an open source DevOps platform.
My motivation to understand the schema of a big project like Gitlab was to compare it against schemas I am designing and learn some best practices from their schema definition. I can surely say I learnt a lot.
I am aware that best practices are sometimes context dependent so you should not apply them blindly.
The Gitlab schema file structure.sql [1] is more than 34000 lines of code. Gitlab is a monolithic Ruby on Rails application. The popular way to manage schema migration is using the schema.rb file. The reason the Gitlab team decided to adopt structure.sql instead is mentioned in on of their issues [2] in their issue tracker.
Now what keeps us from using those features is the use of schema.rb. This can only contain standard migrations (using the Rails DSL), which aim to keep the schema file database system neutral and abstract away from specific SQL. This in turn means we are not able to use extended PostgreSQL features that are reflected in schema. Some examples include triggers, postgres partitioning, materialized views and many other great features.
In order to leverage those features, we should consider using a plain SQL schema file (structure.sql) instead of a ruby/rails standard schema schema.rb.
The change would entail switching config.active_record.schema_format=:sql and regenerate the schema in SQL. Possibly, some build steps would have to be adjusted, too.
Now, let’s go over the things I learnt from Gitlab Postgres schema.
Below are some of the tweets from people on this article. If you find this article useful please share and tag me @shekhargulati
If you want to learn a little bit how others are designing their database schemas you will like the analysis of the Gitlab schema. https://t.co/oxPC2HCj4g
— Tobias_Petry.sql (@tobias_petry) July 15, 2022
1. Using the right primary key type for a table
In my work I have made the mistake of standardizing on primary key types. This means standardizing on either bigint or uuid so all tables will have the same type irrespective of their structure, access patterns, and growth rate.
When your database is small this does not have any visible impact but as you grow primary keys have a visible impact on storage space, write speed, and read speed. So, we should give a proper thought process on choosing the right primary key type for a table.
As I discussed in an earlier post[3] when you use Postgres native UUID v4 type instead of bigserial table size grows by 25% and insert rate drops to 25% of bigserial. This is a big difference. I also compared against ULID but it also performed poorly. One reason could be the ULID implementation.
Given this context I was interested to learn how Gitlab chooses primary key types.
Out of the 573 tables, 380 tables have bigserial primary key type, 170 have serial4 primary key type, and remaining 23 had composite primary keys.They had no table that used uuid v4 primary key or any other esoteric key type like ULID.
NameDescriptionRangeTextserial4 bytes1 to 2147483647~2.1 billionbigserial8 bytes1 to 9223372036854775807~9.2 quintillion
1 quintillion is equal to 1000000000 billions
The decision to choose serial or bigserial is dependent on the number of records in that table.
Tables like application_settings, badges, chat_teams, notification_settings, project_settings use serial type. For some tables like issues, web_hooks, merge_requests, projects I was surprised to see that they had used the serial type.
The serial type might work for self-hosted community or enterprise versions but for Gitlab.com SaaS service this can cause issues. For example, Github had 128 million public repositories in 2020. Even with 20 issues per repository it will cross the serial range. Also changing the type of the table is expensive. The table has to be rewritten, and you will have to wait. This will also be a problem if you have to shard the table.
I performed a quick experiment that showed that for my table with two columns and 10million records it takes 11 seconds to change the data type from integer to bigint.
create table exp_bs(id serial primary key, n bigint not null);
Insert 10million records
insert into exp_bs(n) select g.n from generate_series(1,10000000) as g(n);
Change column type from integer to bigint.
alter table exp_bs alter column id TYPE bigint;
ALTER TABLE
Time: 10845.062 ms (00:10.845)
You will also have to alter the sequence to change its type as well. This operation is quick.
alter sequence exp_bs_id_seq as bigint;
This finished in 4ms
ALTER SEQUENCE
Time: 4.505 ms
All the bigserial sequences start from 1 and go till the max value of bigint.
CREATE SEQUENCE audit_events_id_seq
START WITH 1
INCREMENT BY 1
NO MINVALUE
NO MAXVALUE
CACHE 1;
2. Use of internal and external ids
It is generally a good practice to not expose your primary keys to the external world. This is especially important when you use sequential auto-incrementing identifiers with type integer or bigint since they are guessable.
So, I was curious to know what happens when you create a Gitlab issue. Do we expose the primary key id to the external user or do we use some other id? If you expose the issues table primary key id then when you create an issue in your project it will not start with 1 and you can easily guess how many issues exist in the GitLab. This is both unsafe and poor user experience.
To avoid exposing your primary keys to the end user the common solution is use two ids. The first is your primary key id which remains internal to the system and never exposed to any public context. The second id is what we share with the external world. In my past experience I have used UUID v4 as the external id. As we discussed in the previous point there is a storage cost involved with using UUID.
GitLab also uses internal and external ids in tables where ids have to be shared with the external world. Tables like issues, ci_pipelines, deployments, epics, and a few others have two ids – id and iid. Below is the part of the issue schema. As shown below iid has integer data type.
CREATE TABLE issues (
id integer NOT NULL,
title character varying,
project_id integer,
iid integer,
// rest of the columns removed
)
As you can see there are id and iid columns. The value of the iid column is shared with the end user. An issue is uniquely identified using project_id and iid. This is because there could be multiple issues with the same iid . To make it more clear, if you create two projects and create one issue in each of the repositories then they both need to have a visible id of 1 as shown in the example below. Both the sg and sg2 projects start with issue id 1. This is achieved using iid.
https://gitlab.com/shekhargulati123/sg/-/issues/1
https://gitlab.com/shekhargulati123/sg2/-/issues/1
They have a unique index on project_id and iid to quickly and efficiently fetch an issue.
CREATE UNIQUE INDEX index_issues_on_project_id_and_iid ON public.issues USING btree (project_id, iid);
3. Using text character type with check constraints
Postgres has three character types as described in their documentation[5].
NameDescriptioncharacter varying(n), varchar(n)variable-length with limitcharacter(n), char(n)fixed-length, blank paddedtextvariable unlimited length
I have mostly used character varying(n) or varchar(n) to store String values. Gitlab schema uses both character varying(n) and text but more often they use text type. One such example table is shown below.
CREATE TABLE audit_events (
id bigint NOT NULL,
author_id integer NOT NULL,
entity_id integer NOT NULL,
entity_type character varying NOT NULL,
details text,
ip_address inet,
author_name text,
entity_path text,
target_details text,
created_at timestamp without time zone NOT NULL,
target_type text,
target_id bigint,
CONSTRAINT check_492aaa021d CHECK ((char_length(entity_path)
>>> Read full article>>>
Copyright for syndicated content belongs to the linked Source : Hacker News – https://shekhargulati.com/2022/07/08/my-notes-on-gitlabs-postgres-schema-design/