gidi CLI 💻, Scale from 0 to Million 🌎

By Prajwal Haniya

Techletter #57 | January 28, 2024

How to scale your app from zero to million?

When you start building software, you keep it simple. If you are in a startup then you don’t have a choice, you have to keep it simple at the beginning & keep improving on top of it. You have a single server setup.

  1. Users access the server through Domain Name. DNS is going to be a paid service & not hosted by us. (You can’t do everything in-house, because it’s reasonable to pay for the third party than having your own setup).
  2. DNS returns the IP(Internet Protocol) address to the client.
  3. HTTP requests are made to the web server through the IP address.
  4. The server returns the respective response

As the users grow you need multiple servers: one to manage requests and the other for the database.

Choosing Relational Databases like MySQL and PostgreSQL may be a good idea, as they have proved themselves for the past 40-50 years.

Non-relational databases are good if you want low-latency, your data is unstructured, you store massive amounts of data, etc.

Scaling techniques:

  1. Vertical Scaling

    1. Increase the number of CPUs
    2. Improve the hardware.

    Vertical scaling has limitations because you cannot add unlimited CPUs. And it does not have failover and redundancy

  2. Horizontal Scaling

    Horizontal scaling is best for large-scale applications. This creates another requirement, you need to manage the traffic through a load balancer.

Load balancer:

A load balancer distributes the incoming traffic. Now the user’s requests are handled by a load balancer, and they don’t have direct access to the server. This is better in terms of security. The load balancer and other servers communicate with each other through private IP, so only the servers within the network can communicate.

Adding a load balancer in between solved the issue of failover and improved the availability.

As traffic increases you just need to add more servers. The load balancer will handle the rest.

Here we have only one database, so we haven’t yet solved the issue with failover and redundancy for the data tier.

Database replication is required to address this problem.

Database Replication:

The database replication is done master/slave relationship between the original and the copies.

A master generally supports only write operations and slave databases get copies of master and support only read operations. Most of the applications have higher read operations than write so there will be more slave databases compared to master databases.

Some of the advantages of this are:

  1. Better performance
  2. Reliability (If one server is destroyed by some event, you still have many servers with the same copy of data)
  3. High availability

In this architecture, if one slave database goes offline then the read operation is redirected to another slave database. If all slave databases go offline, then the read operations are redirected to a master database. If the master database goes offline then a slave is promoted to master.

Now we need to improve the response time

The response time can be increased by using

  1. Cache

    A cache is a temporary storage. At first, you check if the data is available in it, if not then request to the database and then write it to the cache and send the response to the client. This strategy is called the read-through cache Cache stores frequently accessed data.

    A cache server is not ideal for persisting data. Whenever the cache is full you need to delete old data through an eviction policy.

  2. CDN

    A CDN is a network of geographically dispersed servers that deliver static content (like images, videos, CSS, JavaScript files, etc). A CDN acts similarly to a cache. Two important things to consider:

    1. CDN fallback: in case of CDN failure you must know how to cope with it.
    2. Invalidating files: Through APIs provided by vendors or through object versioning.

Stateful VS Stateless architecture

A stateful server remembers client data from one request to another. A stateless server has no state information.

Adding or removing servers in stateful architecture is difficult because each server has user data with each request stored in them. So every time you need to forward the request to the same server else it will fail to respond. It is not scalable when compared to stateless architecture, where the session data is stored in a separate persistent storage like MySQL or PostgreSQL.

Message Queues

A message queue is a durable component that are stored in memory and supports asynchronous operations.

It has producers/publishers and consumers/subscribers.

The decoupling nature of message queues makes it suitable for building scalable applications.

Logging, Metrics, and Automation

Once the business is grown, you will need other software that will help you maintain the smoothness of operation of the application.

Database Scaling

Vertical scaling by adding more CPUs, RAM, Disk etc to an existing machine. Horizontal scaling is also known as sharding (adding more CPUs).

Each shard has the same schema, though the actual data on each shard is unique to the shard. UserId is used to allocate the data on the server. Anytime a user tries to access data, a hash function is used to find the corresponding shard.

The most important part is to consider the sharding key. It helps in retrieving and modifying the data efficiently by routing queries to the correct database server. Sharding the data will introduce complexities to the system:

This article is notes from chapter 1 of the book System Design Interview: An insider’s guide by Alex Xu.

Why I built git-discover cli?

The git-discover is a CLI tool that will help you go through any open-source repository on GitHub.

Open-source repositories are fairly complex if you are new to a project. You definitely cannot understand the code just by looking through it. It takes weeks or probably months to understand those projects. I was in the same situation while going through express source code. So, I have built this tool to help me to go through any repositories on GitHub quickly from the start.

GitHub allows you to select the date and filter the commits, but I felt it was too slow for me to go through each commit. So I wanted something quick that is the reason I started building git-discover and I have built this simple prototype.

link to repo: https://github.com/prajwalhaniya/git-discover

Here is the demo: https://youtu.be/IKQdJ8npwwY?si=Db9NFzsxASv0Zh_F