Why do people consider using QoS? The answer is simple: it can save you money.
We live in a world where money dominates, whether we like it or not. Importantly, even if your primary goal is not to save money, if a solution is designed and implemented correctly, it will in fact do just that.
The importance of having QoS enabled on your network usually becomes apparent during two different scenarios. The first being when an actual problem arises. The second, which is something all network engineers should be striving towards, is during the planning of a QoS strategy.
Planning is the key to building a perfect network
Do not wait for things to happen. It’s already too late when you have a situation you didn’t think of occur. Part of planning comprehensive QoS strategy is to continually discuss and theorise about all the variables that can go wrong.
Our practice is to read various RFOs regarding issues that have lasted for hours, meaning that valuable engineers have their day-to-day operations interrupted until the problem is solved. By continually evaluating your QoS strategy, engineers’ time is spent on actually preventing the issues, rather than reacting to live events, which will take up valuable manpower.
Investing in link upgrades is expensive compared to QoS
Some engineers don’t think about saving money, they are purely focused on developing a solution. At Custodian Data Centres, our technical director is an engineer and as a member of the senior team, he continually has to balance the need to be cost-effective with his desire to perfect the network.
The ability to balance this equation has not led to a compromise, but instead to a solution that exceeds the results if you had simply relied on spending money on upgrade links between PoPs.
A simple network configuration
Imagine a simple network consisting of 2 PoPs.
PoP A has some transits available which are much cheaper due to its geographical location.
PoP B has a more convenient location, just outside the M25, with an abundance of power available.
Consequently, we want to take a transit from where it is cheap, to where it is needed. This is achieved by taking two diverse links from one location to another, with each of them capable of taking your entire normal traffic down to PoP B.
The keyword is ‘normal’
If your ‘normal’ traffic is 4-5Gbit/s, you will need two diverse 10G links from PoP A to PoP B to carry it. The theory behind having two links is because one of them can fail and you’ll need a contingency plan for this.
Please note: In PoP A, you would have to have at least 3 transits to facilitate your inter-PoP capacity at 20G. You have to allow for at least one transit to fail or do the maintenance work. Normally, you would have four or even more, just to allow for the maintenance’s, failure and to minimise the disruptions.
Attack!
Unfortunately, we live in a world of antagonists, who thrive on attacking you and/or your client. This may be down to a specific reason, but they could just be having fun. Attacks are one of the reasons you will no longer have your normal level of traffic. For the sake of QoS, the attack we are concerned with is your transit uplink or inter-site link saturation.
Immediate response options
First of all, an attacker can succeed or fail. For the purpose of this illustration, we will say that success is when they have managed to saturate the transit uplink.
In this situation, we would have to blackhole the target IP to save everyone from being affected by a packet loss. Please bear in mind, we will not discuss DDoS mitigation options, as this out of QoS scope.
If the attack fails, it can still cause issues.
Why? Well, if the attack managed to create inbound traffic of 7-8Gbps on each of the 4x10Gbps transit providers, despite the actual transit ports not being saturated, we still have 28-32Gpbs of traffic inbound to PoP A. This simply won’t fit via a 10Gbps (2x10Gbps accounted for a failure of one) link down to PoP B.
So, what are the options?
Option 1: Upgrade links
A lazy approach would be to upgrade your links between PoPs from 10Gbps to 40Gbps. Engineers don’t tend to focus on money, however. When management calculates the cost of upgrades, the answer is often: no.
Simultaneously, they need to solve the problem of one client being affected by an attack on another client. Unfortunately, a lot of colocation facilities won’t do this. They’ll simply advise you that, you or they are having an attack, then proceed to blackhole the target IP. Still, even if an attacker failed to saturate the transit uplinks, clients will be disrupted, unless inter-PoP links are upgraded to 40Gbps or QoS is implemented.
Let’s say your company has lots of money, and you upgraded your inter-PoP links to 40Gbps. Now, when an attacker fails to saturate your transit uplinks, you’ll be able to take all 28-32Gbps of traffic down to PoP B.
So what now? The client under attack most probably has an SLA of 200Mbps on a 1Gbps port, and you will deliver all of the 20+Gbps attack to them. Was it worth taking it down to PoP B just to saturate your client’s port? Regardless, you have solved your initial problem as other clients are unaffected by this attack.
Option 2: Implement QoS
First of all, you need to create a traffic class per client. It’s quite easy for us here at Custodian, as every client and their IP ranges are in the same system that creates access lists and class maps for routers. This means we can simply copy & paste the configuration – it’s up to your engineers to create this and something we really recommend doing.
Next, you must agree on the traffic marking. Every frame or packet has its own encapsulation and thus, QoS tag, which consists of three main ones in an IP/MPLS network, which would be COS, DSCP and EXP.
Then, you must plan your maps, queues and thresholds to achieve the goal of dropping the least important traffic if it doesn’t fit somewhere. Additionally, if you treat your markings uniformly across your entire network, it is easy to both implement and troubleshoot, if an issue arises. After this is complete, you’ll only need to assign each frame/packet a QoS label to finish off.
Example
A customer has an SLA of 200Mbps on 1Gbps port. In other words, you allow them to burst up to 1Gbps while guaranteeing to deliver the first 200Mbps.
In this case, you will (aggregately or not) police (drop) everything that is above 1Gbps in PoP A before marking 200Mbps as higher priority and 800Mpbs of low priority.
When these frames/packets are travelling down your links to PoP B, in the case of link saturation, you will only drop those 800Mbps of low priority traffic, which is out of SLA anyway.
Here’s what your QoS should do:
1. Drop anything above what you can deliver to the client (why bother taking 20+Gbps of traffic down to where you will drop it anyway?)
2. Mark the remaining traffic as a high priority for an SLA and low priority above it – in the case of saturation, you won’t breach SLA and won’t affect other clients.
After implementing this, you will need to have enough capacity to direct your SLA traffic to PoP B. This is probably your normal traffic of 5-6Gbps, and anything above it is not in an SLA, so It can be delivered, depending on your decision of whether to deliver or not.
Interestingly, as you still have lots of spare capacity here, even client bursts will be satisfied and almost ‘never’ will a packet be dropped.
Caveat
Unfortunately, it’s impossible to summarise everything we have done in the past and continue to do, but it’s worth mentioning that you should always be vigilant and conduct plenty of testing before putting your QoS into practice. For example, you need to allocate buffers to queues, as well as think about de-queuing rates and having your control traffic go into an even higher priority queue. Never put traffic of the same class into different queues, just use thresholds instead, or you will get packet re-ordering. You must also always ensure you filter all control-plane traffic in order to protect switch/router CPUs.
There are also always lots of bugs in the routers/switches and documentation. Even if you configure it as it should be, that doesn’t mean it will work that way because of either a bug on the equipment or error in documentation. That’s why you should always test every feature on every class of devices before using it to ensure it is working as it should.
In essence
As demonstrated, QoS has saved money on not upgrading links from 10Gbps to 40Gbps and has also made the network scale, which is incredibly important for large networks and networks with a lot of clients. So, the perfect network comes at a price of heavy engineering and testing… lots and lots of testing! But, as said, it is definitely worth doing this as it scales well, simplifies troubleshooting and saves you lots of money.
Pavelas Sokolovas is head of infrastructure at CustodianDC.