Layer-4 Load Balancer for Flow Size Prediction with TCP/UDP Separation Using P4

Nowadays, datacenters become more complicated and handle many more users’ requests. Custom protocols are becoming more demanded, and an advanced load balancer to distribute the requests among servers is essential to serve the users quickly and efficiently. P4 introduced a new way to manipulate all packet headers. Therefore, by making use of the P4 ability to decapsulate the transport layer header, a new algorithm of load balancing is proposed. The algorithm has three main parts. First, a TCP/UDP separation is used to separate the flows based on the network layer information about the used protocol in the transport layer. Second, a flow size prediction technique is adopted, which relies on the service port number of the transport layer. Lastly, a probing system is considered to detect and solve the failure of the link and server. The proposed load balancer enhances response time of both resources usage and packet processing of the datacenter. Also, our load balancer improves link failure detection by developing a custom probing protocol.


3155
The SP is generated by the physical servers. It is a single frame in which the payload is encapsulated by the probing protocol. The SP is generated every 10ms then sent to the edge switch that is connected to its server. For the SP, the one-byte header of the probing protocol is used to represent the CPU utilization of the physical server.
By the P4 code running on the switches, after an edge switch received a frame, it checks the Ether Type. If the Ether Type is 0x011, the frame is identified as an SP frame. To it, the SP frame is passed through the corresponding parser written in P4.

Switches Probe (WP)
By the time of an edge switch received an SP, it will generate the second type of probes, i.e. the WP. For the edge switches, they do not generate any WP if they do not have a connection with an active physical server. For the aggregation and core switches, they never generate a WP, rather they only replicate the received WPs. Therefore, not receiving a WP from a switch in the lower tier means that there is no route to an active physical server through that switch.
To prevent probes' loops within the DC, each pod is isolated from the others. Besides, the WP probes can be passed from the lower to the upper tier of switches, from aggregation switches core to switches, or from aggregation switches to edge switches. Figure-3 clarifies the probing system (considering the existence of only two servers).

The Probes Mechanism
In the proposed DC, every edge switch has a one-bit register for each connected server to mark the functionality of this server by setting the corresponding register to 1. Similarly, the other switches' tiers have also a one-bit register for each connected switch in the lower tier. Besides, the time threshold for a switch to wait for receiving an SP or WP is 12ms, then the switch or server will be marked as failed.
The probes mechanism can be clarified by the following procedure: 1. A physical server sends an SP to the connected edge switch every 10ms. 2. When the edge switch receives an SP from a server, the corresponding register is set to 1. 3. Whenever an edge switch receives an SP, it creates a WP. Then, it sends it to the connected aggregation switches. 4. In turn, when the aggregation switch gets a WP from an edge switch, it replicates this WP to the connected core switches. As well, the aggregation switch sets the P4 register of the corresponding edge switch to 1. 5. Consequently, the core switches replicate the WP produced from the aggregation switches to the ToR switch and set the P4 register of the sender aggregation switch to 1. At this point, the ToR switch knows the active switches that have routes to active physical servers. 6. In case the edge switch does not receive an SP from a connected physical server within 12ms, the server is stamped as failed by resetting the register to 0. Subsequently, no request will be forwarded to it. 7. If an edge switch does not receive any SP from the two connected physical servers, it will not generate a WP to send it to the connected aggregation switches. Accordingly, these aggregation switches will replicate nothing to the core switches. If a core switch receives no WP from all the connected pods, it will not send a WP to the ToR switch. In this scenario, that core switch will be marked as failed. Any the switch does not receive a WP from the connected switches in the lower tier of the switches, the corresponding registers are reset to 0.

Flow Size Prediction
After making sure that the switch chosen by the TCP/UDP separation is active, the prediction method is started to check if the switch is the best choice or the RR counter should be incremented to the next switch. In the term of networking, the flows are classified into two categories depending on the size of the data transfer. The first category is the elephant flow and the second is the mice flow [12].
An elephant flow is a large data transfer that occupies a huge amount of the bandwidth and requires relatively a long time to finish. Examples of elephant flows are video streaming and software updates. On the other hand, a mice flow is a small size flow that requires a little time to complete. Examples of mice flows are web-browsing and internet searching [13].
Each service on the internet is basically an application layer protocol that has an identifier called a service port number (SPN). The headers of the layer 4 protocols, i.e. TCP and UDP, have a field for the SPN. As a result, any layer 4 devices within the network can recognize the flow layer 7 protocol.
Since each protocol has its own SPN, then by checking the SPN of the TCP and UDP header, the layer 7 services (or protocol) flow can be identified. As a result, the flow size can be predicted depending on the SPN.
In this paper, the DC is supposed to provide a mixture of both elephant and mice flows. More specifically, each physical server has two virtual servers. The virtual servers have the port numbers of 80 and 554. Respectively, these two ports ensure that the DC provides two services, one for web paging and the other for video streaming. The port number 80 uses TCP, while 554 uses both TCP and UDP. The supported port numbers can be modified to any port numbers and extended to more port numbers by coding.

Routing Mechanism
When a host request a mice flow service and sends it to the ToR switch, the ToR switch forwards it directly to the next core switch using RR scheduling. In contrast, if the issued request is for an elephant flow service, the ToR switch checks the core switch, which it is in turn on the RR. If the core switch is busy with a mice flow or free of any loads, the request will be forwarded to that core switch. But, if the core switch is loaded with an elephant flow, the ToR switch will check whether the flow is marked as a timeout or not. If the flow is timeout, then the request will be forwarded to that core switch. Otherwise, the next core switch on the RR will go through the same procedure. When all the core switches are busy with elephant flows, the request will be forwarded to the next core switch on the RR.
The core and aggregation switches, as well as the ToR switch, do the same checking before routing any request to a switch in the next lower tier of switches.
When an edge switch receives a request, it will forward it to the server with the lowest CPU usage. The CPU usage of the servers is broadcasted from the servers to the edge switches. Whenever the request is received by a server, the server will respond to it on the same path that the request was routed on. Figure-4 shows a flowchart of the complete system routing.

3.1
Flow Completion/Timeout Before forwarding any host request, the ToR, core, and aggregation switches must check the current switch on the RR in the next tier of switches. It must be ensured that the chosen switch is either having a mice flow (or completely free) or it has a timeout elephant flow.

TCP Flow Timeout
For a TCP elephant flow to be tagged as a timeout flow, one of the following conditions must be met:  Flow completion  Flow interruption If a switch detects that a TCP elephant flow is completed or interrupted, the switch will consider the flow as timeout. As a result, a new host request can be forwarded to this switch.
Detection of TCP Flow Completion: The TCP is a connection-oriented protocol. An established connection between any two peers is maintained until one of the peers terminates the connection. The termination procedure begins by setting the FIN bit in the TCP header by a peer, and the other peer responds by setting the ACK bit. Then, the same procedure is repeated by the other peer. After that, the TCP connection is closed.
Accordingly, a TCP flow completion can be detected by checking the FIN bit. Whenever the FIN bit is found to be 1, the connection is marked as a completed flow. Subsequently, the CITO register will be set to 1. Also, the TFTO register will be set to one as well.
Note that, in this work, the acknowledgment (ACK) of the FIN is not waited to be received. In the instant that a FIN of a TCP flow is detected, a corresponding completion/interruption timeout register is set to 1. This is because the data transferring is done, and there is no need for further delay for waiting the full termination procedure to be finished. 3158 completion/interruption timeout register to make forwarding decisions without waiting for the deletion time, which may cause a fault routing.

[2]
Detection of TCP Flow Interruption: Due to some reasons, a flow may be interrupted before the completion. In such a case, the FIN bit will never occur. Therefore, the switch, where the interruption has happened, will always be found to be loaded with an active elephant flow. Furthermore, no host requests will be routed to that switch.
The interruption situation can be solved by setting a threshold time. This threshold is the interframe gap between two received frames of the same flow at the server-side. The threshold is chosen to be 115µs. Thus, when a switch receives a frame that belongs to a flow, the switch will wait for 115µs. If the threshold time was passed without receiving the next frame, then the flow is timeout, and the corresponding completion/interruption timeout register will be set to 1.

UDP Flow Timeout
Since the UDP is not a reliable protocol and has no header field for a connection termination information, it cannot be detected when a UDP traffic ends.
For that reason, a simple method is used to announce that a UDP flow is completed. This is achieved by considering that the interruption and the completion are the same. That is, whenever the threshold of 115µs is elapsed, the UDP flow is announced as completed or interrupted. Consequently, the corresponding completion/interruption timeout will be set to 1, and the UDP flow is announced to be a timeout. 4 Results and Discussion To implement the proposed LB, the P4 language is used [14] with its switch simulator, i.e. the behavioral-model version 2 (Bmv2). The DC FatTree network is built using the network emulator Mininet [15]. Compiling a P4 program causes the Bmv2 simulator python code to be executed in the Mininet switches. Afterward, all the switches within the network, which is created using Mininet, become P4 based switches.
For FatTree network construction, a factor called k must be known, which represents the number of ports of the switches within the FatTree. The number of core switches equals (k/2) 2 . The network, also, has k pods. Each pod has (k/2) aggregation and edge switches. Within a pod, each aggregation switch is connected to the (k/2) core and edge switches. The edge switches, within a pod, are connected to the (k/2) aggregation switches and (k/2) servers [16]. Figure-5 shows a FatTree topology with k=4, which is the network of this work.  Table- The overall performance of the LB is limited by the aforementioned specifications of both the physical and the virtual machines. Those specifications achieved a bandwidth of about 1Mbps for all the links between the switches.
The probes were tested for both encapsulation and decapsulation times to show the efficiency of the proposed probing protocol. Consequently, some performance metrics were measured for this work. At first, the scalability, which shows the benefits of the probes system was measured. Then, the CPU utilization of the switches that are located within the DC was estimated. Afterward, the response time for packet processing was calculated. Furthermore, two more tests were conducted, one for throughput and the other for latency. Lastly, the Request Handling Time (RHT) was calculated to evaluate the benefits of the TCP/UDP separation technique.

4.1
Probes Evaluation As described before, the failure detection system presented by the SP and WP is a one byte system that is encapsulated by only the Ethernet header (layer-2 header). Thus, the time of both the encapsulation and decapsulation processes is less than that time for a frame with full headers encapsulation. Figure-6 demonstrates the enhancement of the probes in terms of reducing the processing time.

Scalability
For a DC, scalability is the ability of the DC to work in case of a device (network device or server) is scaled in or out. The scaling out may imply the failure of a device or a link between devices [1].
In the scenario of DC extension with additional switches or servers, and after the right configuration, flows can be directed to that device. This is because the probes advertise the presence of this device to the connected switches in the upper tier. Table-3 shows a scenario of several elephant flow requests (denoted as R n ) arrived at the ToR switch and routes. The requests must be forwarded on based on the proposed algorithm (referring to Figure-3). TCP S17-S9-S2-Server3 R 2 UDP S20-S14-S6-Server12 R 3 TCP S18-S9-S2-Server4 The link between S17 and S9 is down, as well as the link between S6 and s14. In addition, a new server (called Serverx) is added to the edge switch S2 between Server3 and Server4. As a result, the new routes are as shown in Table-4. Table 4-Chosen Paths After the addition of Link Failure and Serverx R n Change in DC Route R 1 S17-S9 link is down S18-S9-S2-Server3 R 2 S6-S14 link is down S18-S13-S6-Server12 R 3 Serverx is added S19-S10-S2-Serverx Without the probing system, there is no way to failover the failed links, which will cause a connection interruption, and the newly added server will not be noticed.

CPU Usage
As aforementioned, the Mininet is used to build the DC network. Since the Mininet is an emulator, each device within the DC utilizes a fraction from the CPU allocated for the VM.
The CPU usage of every switch can be recorded by monitoring the overall CPU utilization of the VM. The monitoring process showed that the average CPU usage of each switch is about 14% in the no-load case.
We considered the proposed LB works only with the TCP/UDP separation, without the FSP technique. Also, the sequence of the elephant requests shown in Table-5 is received at the ToR switch. R1 is completed just before the arrival of R5. Since the LB works without FSP, the R5 will be directed to S18, causing it to be busy with two elephant flows (R3 and R5). By proceeding with such a scenario, the S18 will be busy with more than two elephant flows.  R1  TCP  S17  R2  UDP  S20  R3  TCP  S18  R4 UDP S19 R5 UDP S18/S17 Figure-7 highlights the issue of S18 processing more than one elephant flow and how would that affect the CPU usage. Also, the figure shows how the FSP technique reduces the CPU usage by reducing the number of flows on S18 to fix it to one as long as possible, or at least to the possible minimum number.

Figure 7-S18 CPU usage with and without FSP
With the absence of the prediction, and according to the TCP/UDP separation, the LB decision for R5 is to forward it to S18, which is already busy with R3. On the other hand, the FSP causes the ToR switch to check S18, and the checking results will indicate that S18 is busy with an elephant flow. Afterward, the RR counter will be incremented by one, and then the LB will choose S17 as the next hop, because S17 is not busy with an elephant flow anymore.

Packet Processing Response Time (PPRT)
When a packet is received on any of the P4 switches, the proposed LB algorithm will make the right decision in choosing the proper output port. The time from the moment that a P4 switch receives a packet to the time that it forwards this packet to an output port can be called Packet Processing Response Time. The average PPRT of each switch within the DC, when it is busy with only one flow, is 0.018µs.
The PPRT is proportional to CPU usage. The increase in CPU usage causes the PPRT to be increased. The same scenario shown in subsection 4.3 is considered here, in which the CPU utilization increases with the increase in the load on S18, when the switch operates without FSP. Shortly, a busy switch means a higher PPRT value. Figure-8 shows the response time of S18 when it operates with and without FSP.  Figure-8, it is clear that when S18 handles only one elephant flow, the PPRT value is much better than when it handles multiple flows. The PPRT affects not only the elephant flows, but also the mice flows. The TCP/UDP separation does not guarantee the minimum number of flows on a switch while, oppositely, FSP guarantee that. As a result, the proposed LB attempts to make a P4 switch to be busy with a minimum number of flows, which leads to lower CPU usage and, consequently, lower PPRT.

Throughput
To measure the efficiency of the FSP technique of the proposed LB, multiple requests for a video file of size 10MB are sent to the DC from many different hosts that are connected to the ToR switch. The switches of the DC choose the path to which each request is forwarded. A server will respond to a request on the same chosen path. Then, the throughput is measured to the response flow for the first request, once when the path is occupied by only the flow and another when the path is shared by multiple flows. The throughput can be measured by Equation-1.
Equation 1 Figure-9 shows the throughput for the first request when the LB is working with and without FSP. From the figure, it can be concluded that the FSP technique enhanced the throughput of the flow, because the FSP technique utilizes the available bandwidth by distributing the load and placing the least possible number of elephant flows on each path.

Latency
In networking, latency is the time required for the data or request to go from the source to the destination and then return to the source [17]. This can be measured by using an edited version of the Ping command-line to evaluate the round trip time, which represents the latency. The edited version ensures that Ping can work with the service port numbers 80 and 554 in both TCP and UDP protocols.
The latency is measured by using the Ping to send a stream of packets of size 64 bytes from different hosts to the servers in two scenarios. The first scenario is integrating the LB with FSP and the other is without FSP. Figure-10 shows the latency results recorded from the test.

Effects of TCP/UDP separation on request handling time
To demonstrate the benefits of the TCP/UDP separation in reducing the required time of forwarding a request, the RHT metric is needed to be calculated. RHT is the required time for a switch to select the next one in a route for an incoming request. Lower RHT values represent better performance. This is because low RHT means less time in forwarding a request to the next switch. The RHT value depends on the time required to check a switch in RR turn, i.e. whether it has an elephant flow or not, and consequently the time to increase the RR counter to the next switch if the result of the flow checking is true. In a scenario such as that shown in Table-6, when the ToR switch receives the elephant flow requests, it must check each core switch before placing the request. At the time that R5 is received, R4 is already finished. As a result, and according to the FSP system, R5 should be forwarded to S20. With or without the TCP/UDP separation, S20 will be chosen as the next hop, but the separation saves time in choosing S20.  R1  TCP  S17  R2  TCP  S18  R3  TCP  S19  R4 TCP S20

R5 UDP S20
To calculate RHT, Equation-2 is used, where (S) is the time when a received request is placed at an output port of a switch after choosing the next hop, (R) is the time when the request is received at an input port of the switch, and 0.015ms is the average time of encapsulation/decapsulation.

Equation 2
As shown in Figure-11, the RHT value for R5 without the TCP/UDP separation is higher than that with TCP/UDP separation. This can be explained as follows. When R5 is received at an input port of the ToR switch, the ToR switch will check the core switch which is in turn in the RR scheduling, which is S17. Then, the checking process will start from S17. Since it is busy with R1, the ToR switch will jump to S18 and check it. S18 and S19 are both busy with R2 and R3, respectively, and as a result, S20 will be checked and chosen as the next switch because R4 was completed just before receiving R5. In contrast, using the separation technique, and since R5 is UDP, S20 will be chosen directly. In this scenario, the consumed time is just for checking S20, without time-wasting in incrementing the RR counter to the next switch and checking other switches.

Figure 11-RHT for ToR Switch with and without TCP/UDP Separation
Despite the benefits of the FSP technique, it consumes time in checking a switch then jumping to the next. But by combining the TCP/UDP separation with FSP, the RHT can be reduced.
The TCP/UDP separation technique may give L4FSA a benefit in reducing the RHT without adding any extra load, because the time required for the separation is trivial.

Comparative Study
To show the improvements and enhancements of the proposed LB, it was compared to the aforementioned related works by some important factors. These factors are: A. Link/Server Failure: Hula uses probes for link failure detection, but they cannot detect server failure. The probes of this proposal can detect both of the failures. Besides, the Hula probes are 64 bytes in size, while the proposed probes are only 15 bytes. In the work of Efficient Multipath, the failure cannot be detected. B. Flow/Flowlet: All the discussed related works are Flowlet-based. Despite its benefit of reducing the connection interruption, it can add an overhead of splitting a flow into flowlets and choosing the right slicing time to prevent flowlets reordering at the client-side. Thus, the proposal utilizes a flowbased LB relying on the probes for fast link failover. C. Up/Down Stream Switch Notation: All the related works are supposing a DC with up and down switch notation. Only Hula adopted the separated notation to prevent probes forever loops. The proposed work, and like the other related works, is using the combined notation, which is more real and popular.

D. Supported Topologies:
This work is designed and tested in a FatTree DC. However, any other DC topology can be adopted. The related works can be run with any topology, except for W-ECMP, which requires a topology with an equal number of hopes. E. Next Path Holding: Except for Hula and Efficient Multipath proposals, in which a switch holds only one path, the other related works, as well as this work, use a switch that can provide all the available next paths. F. Congestion Information technique: For Hula, the congestion information is carried by the probes, while for the other works it is carried over the regular traffics. Due to the full-layers header encapsulation of the probes in Hula, the congestion may be noticed lately. On the other hand, the other works deliver the congestion information over regular traffics, which means additional unnecessary overhead due to adding extra encapsulation headers. In contrast, this proposal records the traffics in-formation in each switch with incoming requests arrival then uses these pieces of information to determine the congested links.
The comparison between this proposal and the related works in terms of evaluation metrics cannot be accurate. This is because of two reasons: either the authors of those works used a simulator to build the network, while we emulate the network to achive realistic performance, or we could not meet the very high specification of the VM that some of them have used in their researches. In both cases, their possibly achieved results are not realistic and not comparable.

Conclusions
This paper introduced a P4 based load balancing algorithm for DC environment. The proposed LB has many benefits. It reduced the time of choosing the next hop by implementing the TCP/UDP separation technique. While the FSP technique enhanced the PPRT, which resulted in a faster overall traffic processing. Furthermore, the CPU usage for each switch was reduced by making the switch busy with few elephant flows. Also, the consumed bandwidth was decreased by distributing the elephant flows on all the available links. Besides, the proposed probing protocol achieved a best-effort link/server failure detection. The results of this study suggest that, by combining the P4 technology with the separation and prediction techniques, a better load balancing method can be achieved in terms of bandwidth and resource utilization. The present work opens the door for future load balancing research of classifying the traffics into mice and elephants using P4 technology.