Black Friday not only raises the pulse of shopping fans, but also provides excitement for the software engineering teams at Digitec Galaxus. The goal is clear: the store must handle the massive onslaught. Preparations for this will begin in the summer.
Before I joined Digitec Galaxus this fall as a software engineer, I was always interested in what goes on behind the scenes during big discount days like Black Friday or Cyber Monday. This year, I can not only satisfy my curiosity, but also take you on a tour of the stages of preparation. As a developer on the team responsible for our community, I’m right in the thick of things this year.
November in July
It starts on a hot day in July: The organizing committee for this year’s discount days meets for the first time and draws up the first forecast. The questions in the room are: How many customers are we expecting? What preparations do we need to make? What special scenarios must teams prepare for? What are the dangers? Findings from these analyzes help Product Development (our own software development) to make your shopping experience as smooth as possible during Black Friday week.
In the coming months, category management is giving everything and looking for the best offers. Of course, the items of desire must be in stock in time for the campaign days. At the same time, our logistics move promotional offers into the automated storage system as a precaution. This way, the ordered products reach the packaging department faster and are soon on their way to you.
More servers, more power
Our business doesn’t run on just any tin box in our office. We host almost all of our shop systems in a so-called Kubernetes cluster on the Microsoft Azure cloud. Simply put, a Kubernetes cluster consists of a large number of virtual servers, also known as nodes. Our systems run on these servers. The standard configuration is enough for 358 days a year. However, during the week of Black Friday, we expect a massive increase in traffic to our stores. Therefore, at peak times, we expand the cluster with additional nodes in order to scale our systems higher. This procedure is efficient and ecological, because the servers can also be rented by other Azure Cloud customers after the week of the campaign and will not remain unused.
But not only additional servers are in demand. Our development teams also need to prepare features in their area of responsibility for heavy workloads. In my case it’s loading and adding comments or ratings. To do this, we use a tool to simulate high payloads to test the behavior of our store in this scenario. These results and the projected number of users show us where bottlenecks still exist in the system. Shortly before the event week, we go on a shopping tour and rent additional nodes to be ready for the big rush.
What are we doing differently this year?
Last year our store was closed shortly after midnight on Black Friday. To get to the heart of the matter, we need to look at the structure of the store. Our store is divided into several parts. What you see on the screen is the so-called frontend. I work on it every day too. For this part we use React JS, Server Side Rendering with Next JS, Styled Components and Apollo for network requests. Then there is the GraphQL middleware. Here we receive all requests from the frontend, forward them to the appropriate backend systems and return responses in the format required by the frontend. Backend systems do the magic behind the scenes. In these systems, we ensure that all data is stored in the right place and is printed quickly.
Between the backends and the GraphQL middleware, we use the Redis cache to store frequently used data, such as about products. This reduces the load on our databases (MongoDB, SQL Server) and other systems.
Last year, our GraphQL system was scaled to run on 2000 virtual servers simultaneously, sending tens of thousands of requests per second to the aforementioned Redis cache. At some point the cache could no longer handle requests and errors occurred. In addition, there was an application error that repeated these requests endlessly. This increased the burden again and brought the cache to its knees. As a result, many requests ended up in the database, which slowed business down extremely and eventually led to a brief outage. Of course, we don’t just leave it to ourselves. Since this incident, we have revised our caching strategy and now rely on multi-level caching.
We now use the in-memory LRU cache as the first cache level. The most visited products are stored in it. In this way, peaks in requests to our Redis cache can be mitigated. We’ve also fixed last year’s app bug, so the wrong request is just retried. This allows us to avoid overloading the Redis cache.
As of Black Friday and Cyber Monday 2021, we have also revised and simplified the backend system for so-called “Special Offers”. We use this functionality not only for the deals during the Black Friday week, but also for the daily deals we offer you throughout the year. By simplifying it, we hope to reduce supply-side speed issues.
What are we doing on Black Friday?
Theory for now, practice follows on Black Friday. Shortly before bidding, teams manually increase parts of their systems, such as caches or databases, to pre-defined sizes. On Black Friday, we want to play it safe and intervene manually at these critical points so that there are enough servers available to handle the initial surge after midnight. Our system will automatically create more capacity for other parts if there are bottlenecks. So our servers are ready to rumble.
After that, features in the store that are not essential for Black Friday are turned off via “feature flags”. For once, the live feed won’t give you up-to-the-minute information on who’s ordering what from where. We also refrain from recommending products and community contributions.
We store promotional offers in our internal systems as so-called advertising campaigns. We define the validity period for all products and determine availability. Then we let the system work for us. It will automatically display the offers in the store and gray them out when they are no longer available.
What we cannot do without during the campaign week is our continuous commitment. We don’t have a frozen code. If, as a developer, I want to approve a change to the versioning system and a worker in the relevant shop area confirms it, you as a customer will see the change within a few minutes. Here we rely on our automated tests that run before each release, as well as the accountability of our technical teams.
There are emergency calls at midnight on Tuesday night (our first campaign day this year), Black Friday night and Cyber Monday night. Developers from all areas of the business are present and can respond immediately if something goes wrong.
My search for clues answered many of my questions as a newbie. Now I’m looking forward to my first Black Friday week. Do you feel the same or want to know more about me? I won’t reveal any deals, but I’m happy to answer technical questions in the comments.