Over the years, we've done quite some migrations as the company and our skills scaled. We've gone from a single node to a multi-tier cluster using Kubernetes, self maintained worker nodes, two layers of load balancing, several (not so) micro services and appropriate Ansible playbooks.
I've always kind of been into sysadmin related things. First it was just a focus on Linux in general, and understanding the way of kernels and their inner workings (as far as I could comprehend these things, being a novice programmer at that time).
Take only the appropriate steps, but think ahead
Obviously, our single server couldn't actually handle the traffic that we receive now. At that time, it was just enough to serve around 100.000 users in a month with only minor performance issues. There's already a very valuable takeaway here:
It's not always 'more servers'
Accept minor performance issues when starting out your scaling process. Scaling properly is not simply acquiring better performance, scaling is steadily supporting a changing process. It's more organizational than simply User Experience (UX) focussed. Scalability is being used and abused in fancy ways and sometimes restricted to pure technical contexts, but I've found that a more holistic approach provides a better thinking-tool. And that's really the only thing the concept of scaling is useful for; a tool for reasoning and communicating about how to deal with a changing process. It's more about enabling teams to deliver properly. Not only to your users but also to, for example, your colleagues. I call this ops scaling, in contrast infrastructure scaling. Ops scalability can be defined as the ability of your team to handle change in demand, and this largely depends on the quality of communication.
First step of infrastructure scaling: load balancing
Taking the above into account, I chose to use Haproxy for our HA load balancing setup in a specific project we worked on. It's very performant, quite easy to set up and requires low maintenance. It was perfect to start out with, and has definitely held its own during the last couple of years. We very much needed SNI* support to make our job easier, as a lot of domain name get served through one load balancer. Instead of doing manual configuration for every subdomain, I simply set up an Haproxy ACL with SNI configuration, enabling me to quickly upload certificates without changing any (or just a little) configuration.
Dare to take some risks
Trust on communities and open source projects.
As we mostly use Python and Django for our backend chores, we have good reason to use PostgreSQL as our database. We've run a production database without a hot standby server for some time. Of course, the moment I had time, this was fixed. Generally, I like to approach acceptable risks as the conscious creation of a problem that you know you will (and can) solve later. It allows for constant improvement, and also the necessary pressure to keep yourself busy and motivated.
Summing up, it comes down to:
Have the guts to be like this guy, but also try really hard not be:
*SNI stands for Server Name Indication and can be used to use multiple certificates on a single IP address. It involves the server using the hostname at the start of the TLS handshake to select the correct certificate to present to the client.