Thermal performance evaluation of a high-density data centre for cooling system under fault conditions

. In this study, an actual 20 MW data centre project was analysed to evaluate the thermal performance of an IT server room during a cooling system outage under six fault conditions. In addition, a method of organizing and systematically managing operational stability and energy efficiency verification was identified for data centre construction in accordance with the commissioning process. It is essential to understand the operational characteristics of data centres and design optimal cooling systems to ensure the reliability of high-density data centres. In particular, it is necessary to consider these physical results and to perform an integrated review of the time required for emergency cooling equipment to operate as well as the back-up system availability time.


Introduction
Over the past decade, data centres have made considerable efforts to ensure energy efficiency and reliability, and the size and stability of their facilities have been upgraded because of the enormous increase in demand [1,2]. If a data centre experiences a system outage or fault conditions, it becomes difficult to provide stable and continuous IT service, and if this situation occurs on a large scale, it can even lead to chaos in the finance industry, the stock market, telecommunications, and the Internet. Therefore, it has become critical to design and implement backup systems so that system stability can be maintained even in emergency situations. Considering environmental problems, ensuring stable and efficient performance is an important aspect of data centre design and implementation. Conventional construction methods are focused on tangible results such as the construction period and financial costs necessary to set up the data centre infrastructure according to the design plans, which can make it difficult to verify whether non-IT equipment is demonstrating proper performance or the design is operating as intended. Furthermore, it is not immediately apparent when equipment with performance problems is installed, and it can be difficult to detect the problems. Even if the problems are detected, it is difficult to determine clearly who is responsible between the major contractor and vendors; thus, a warranty cannot be requested. Data centres strengthen their stability by having redundant power supply paths, including emergency generators, UPSs, etc. IT servers require uninterruptible supplies of not only power, but also cooling [3][4][5]. For this purpose, central cooling systems are designed to allow for chilled water supply during cooling system outages by including storage buffer tanks for stable cooling of IT equipment. Consequently, the missioncritical facility design for the stable operation of data centres leads to huge cost increases, so careful reviews must be performed from the initial planning stage [6][7][8].
Considering that the number of times that such emergency situations occur during the life cycle of a data centre is very small and the tolerance of IT servers to various operational thermal environments has improved, there is considerable room for reducing the operating times and capacities of chilled-water storage tanks. In a case study on a currently completed 20 MW data centre, we assumed an emergency situation in the cooling system and evaluated the subsequent temperature increases of the chilled water and the indoor temperature changes in the IT server room.  We analysed the thermal environment of the data centre by employing a computational fluid dynamics (CFD) model and quantitatively examined the degree to which the environment inside the IT server room was maintained by secondary chilled water circulation when the central cooling plant was interrupted. The goal was to increase the reliability of the system by using the analysis results to set an initial cooling response limit time and determine an appropriate chilled-water storage tank size.

Major issues in data centre
The results of benchmarking 63 data canters that experienced unplanned outages showed that the damage costs incurred in 2016 were 38% higher than in 2010, and the mean damage cost when an outage occurred at one data centre increased from $505502 in 2010 to $740357 in 2016 [9]. Fig. 1 summarizes the root causes and ratios of system outages in the sample of 63 data centres. The major cause was UPS system failure, which seems to be due to inadequate UPS reliability verification upon initial installation. To resolve this problem, it is necessary to verify equipment and systems thoroughly and to conduct trial runs according to a systematic process from the initial design phase. In addition, cooling system failures accounted for a large portion of the fault conditions. When an outage occurs, the damage cost is lower if the outage time is minimized by a quick response. As such, in order to create stable data centres, backup systems that can handle unexpected accidents must be systematically designed, tested, and managed from the initial stage. Initially, data centres were designed and operated with a focus on IT equipment stability rather than energy savings. In the past, the temperatures and humidity levels of data centres were managed very strictly and kept at 21.5°C and 45.5%, respectively, to maintain optimal thermal conditions for IT equipment operation. For years, IT support infrastructure system design had to match the specifications of high-density power components, including the proper cooling and operating condition ranges of the equipment. A considerable amount of energy is inevitably required to maintain a constant temperature and humidity level. However, due to recent technological advances in the IT field, the heat resistance of such equipment has improved, and as warnings about energy costs and greenhouse gas emissions have become more prominent, the environmental specifications of equipment for air cooling have become somewhat more flexible. As shown in Table 1, an IT server intake temperature of 18-27°C is recommended, and thus allowable temperature ranges of 15-32°C (class A1) and 10-35°C (class A2) with an allowable relative humidity range of 20-80% have been specified [10]. According to The Green Grid report [11], the failure rate in a sample of devices from several vendors. The device failure rate is normalized to 1.0 at 20°C. The failure rate during continuous operation at 35°C is 1.6 times higher than that during continuous operation at 20°C. For example, if 1000 servers are continuously operated in a hypothetical data centre and it is assumed that an average of ten servers will normally experience errors when operating at 20°C, it is expected that an average of sixteen servers will experience errors when operating at 35°C.

Data centre cooling system
As shown in Fig. 2, the cooling system of the case study data centre is composed of central chilled water-type computer room air handling (CRAH) units with a district cooling system that supplies chilled water to a nine-story IT server room.   In total, 43 CRAH units are installed on each floor, and the racks of IT servers are basically installed in a cold aisle containment to allow relatively high temperatures for the supply air (SA) from the CRAH units. Therefore, from the primary side of the heat exchanger, the district chilled water undergoes thermal exchange so that relatively high-temperature chilled water is supplied to the secondary side. Furthermore, an emergency chilled water supply is contained within thermal storage tanks, for use in case an outage occurs. Table 2 shows the supply conditions of the central cooling plant. It is normal for the horizontal piping in a dedicated data centre cooling system to be installed in a loop-type configuration on each floor with a redundant riser to prepare for emergencies. Chilled-water storage buffer tanks were installed to provide stable chilled water before the emergency power begins to be supplied in the event of a cooling system outage. It is important to identify an economical and optimal storage tank size by determining the time range during which the chilled water in the pipes can be recirculated and used without the cooling system without affecting the IT server operating environment. For this purpose, the amount of water in the pipes was calculated first. As shown in Table 3, the total amount of water in the chilled water pipes was calculated to be 220 m 3 .

Thermal modelling for a data centre
By using the data centre thermal model, the SA temperature of the CRAH units and the inlet air temperature of the IT servers were analysed. CFD simulation was performed to find the allowable chilled water supply temperature range by checking the SA temperature range of the CRAH units and calculating the time required for it to reach the allowable chilled water temperature range. In this study, 6 Sigma DC software was used, as it is specially designed to reflect the characteristics of data centres. One of its analysis modules, the 6 Sigma DC Room, is a CFD simulation module that can be used to perform integrated efficiency reviews of IT server rooms. The special feature of this module is that it has various information on and databases for the most important types of IT equipment that are employed in data centres, and it can create accurate IT environments [12]. This software uses a Reynolds-averaged Navier-Stokes solver, with the standard − model used for turbulence. It operates partially as a 'black box' package, i.e. with the user having limited control over certain aspects of the solution procedure. It uses finite volume method and involved the integration of the governing equations over the several control volume regions obtained from the discretization of the solution domain. The calculated variables are located at the centroid of the grid cells.
Next up, the discretized results can be obtained from a set of algebraic equations, each relating the value of a variable in a cell to the value of that in a neighbouring cell [13]. The computational domain corresponds to a grid with 64×100×64 elements contained in a 10×40×10 mm 3 volume. The local Nusselt number distribution in the normalized axial direction is y + = 2y = (D h Re Nu) [14]. The basic module was composed of a minimum of seven cold and hot aisles (area Aʹ in Fig. 3). As shown in Fig. 3, the IT equipment was composed of a rack that could hold a maximum of 42U servers, and the power density was set at 4.0 kW/rack. In total, 192 IT server racks were arranged according to the standards presented in ASHRAE TC 9.9. Table 3. Water content in chilled water pipes.   Table 4 shows the simulation boundary conditions, based on the operation conditions of the IT server. The chilled water temperature was changed to 10-18°C to respond to CRAH unit SA temperatures of 20-25°C. The data centre environment standard ranges at which IT servers could operate normally were set at 18-27°C for the IT server inlet temperature and 40-60% as the judgment standards, as these are the recommended values for Classes A1-A2 (Table 1).

Simulation results
The cooling capacity of a CRAH unit is determined based on the minimum inlet chilled water temperature. Analysis was performed on the air inlet and outlet (SA/RA) temperature variations with the chilled water inlet temperature based on the technical data of the CRAH unit that was used in this study. According to the results, allowable operation conditions are achievable up to a chilled water supply temperature of 14°C, but if chilled water supply temperature is 15°C or higher, the cooling capacity of the CRAH unit is reduced, and the temperature of the air supplied to the IT server room will increase, as shown in Fig. 4. In the CFD simulation results, the IT server inlet air temperature and server room air temperature distribution as functions of the temperature of the air supplied by the CRAH unit and the chilled water supply temperature conditions are shown in Table 5, 6 and 7. Fig. 4. SA temperature and cooling capacity of a CRAH unit according to chilled water supply temperature.    The most important element in the evaluation of the air distribution efficiency of an IT server room is the air temperature distribution within the room, particularly the inlet air temperature of the IT server. Since an increase in the inlet air temperature is a primary factor in server failure, the air distribution efficiency was increased by using cold aisle containment. Although the temperature ranges differ according to the SA temperature, the containment system clearly has a difference between the cold and hot aisle air temperatures because the cold and hot aisles are separated overall. Temperature increases due to the recirculation of outlet air as inlet air of IT servers are a major cause of server failures, and the failures mainly occur at the top of the rack-server. If the server room cooling load is less than 85% of the cooling capacity of the CRAH unit, then acceptable operation conditions are achievable until the SA temperature reaches approximately 24.0°C, according to the design cooling capacity, and at this time the chilled water temperature is 17.0°C. As shown in Fig. 5, if the chilled water temperature exceeds 17.0°C, the indoor temperature increases rapidly as the cooling capacity of the CRAH unit decreases to less than 80%. (Fig. 4)

Temperature control approaches
After finding the allowable chilled water temperature range by performing CFD analysis, the next step was to calculate the time to reach this temperature. The increase in the chilled water temperature after a cooling system outage was analysed by performing a system scale-down to conduct the calculations for a typical floor. Consequently, the total amount of water inside the chilled water pipes was around 45 m 3 /floor based on the typical floor. Table 8 shows the cooling capacity and operating conditions of the CRAH unit of the typical floor. Eq. (1)-(3) are the basic functions for calculating the rate of change of the chilled water temperature in the pipes and the delay time after a cooling system outage.
As shown in Eq. (4), if the heat gain of the IT server room is constant, a constant amount of cooling should be supplied by the cooling coil of the CRAH unit. If the temperature difference ΔT between the chilled water inlet and outlet temperatures is assumed to be constant, the functions for the chilled water temperature changes and delay time can be expressed as Eq. (5) and Eq. (6), respectively. With these functions, the point in time at which the cooling system outage occurred was set to T0 to calculate the range of increase of the chilled water temperature due to cool loss over time. At this point in time, out of the total amount of water, the amount of chilled water at 10°C in the chilled water supply pipe excluding the amount of chilled water at 15.5°C in the chilled water return pipe passing through the CRAH unit is 22.5 m 3 or 50%, and the amount of water at this time is V 1 . In addition, t 1 is the time at which the thermal capacity of V 1 is fully exhausted and the temperature of the entire amount of V 2 reaches 15.5°C, which was calculated to be 3.45 min by applying Eq.   The cooling system selection is implemented to enable the CRAH unit to maintain a set temperature up to a chilled water supply temperature of 14°C for the waterside economizer, and up to this point in time, it is in the safe range. Up to a chilled water temperature of 14°C, the maintenance of the IT environment is unaffected by changes in the air inlet and outlet temperatures caused by changes in the chilled water inlet and outlet temperatures, based on the technical data provided by the equipment manufacturer. As for the inlet air temperature of the IT server, which was found in the CFD simulation, 24°C is the maximum allowable temperature of the supply air from the CRAH units that enables the ASHRAE allowable temperature requirements to be met. Furthermore, the maximum chilled water supply temperature at which the air temperature can be maintained is 17°C. The time required to reach this chilled water temperature is around 5 min 20 s after a cooling system outage, during which allowable operation is possible. Fig. 6 shows the safe and allowable operation periods according to the chilled water temperature.

Conclusions
The temperature conditions and allowable operation times of the chilled water and air sides of a CRAH unit after an unplanned cooling system outage were analysed for the stable and economical design of and equipment selection for data centre cooling systems. CFD analysis of each stage was performed to analyse the temperature of the air supplied to the server to remove the heat gain of the IT server with changes in the supply air temperature and cooling capacity of the CRAH unit, and the results were compared to the ASHRAE standard.
(1) Up to a chilled water supply temperature of 17°C and a CRAH unit air supply temperature of 24°C, the temperature of the air flowing into the IT server fell within the required range in 18-27°C. Using a CRAH unit coil capacity of 85%, it was possible to perform allowable operations for approximately 320 s after cooling system outage. (2) Starting at a CRAH unit chilled water supply temperature of 18° and an air supply temperature of 25°C, the coil capacity became smaller than the cooling load, and a rapid temperature increase occurred, which is a serious cause of IT equipment failure. (3) During a cooling system outage, there is a high possibility that a rapid temperature increase will occur inside the IT server room. Thus, backup systems must be activated within 300 s.
It is essential to understand the operational characteristics of data centres and design optimal cooling systems to ensure the reliability of high-density data centres. In particular, it is necessary to consider these physical results and to perform integrated reviews of the time required for emergency cooling equipment to operate and the availability time of the chilled water storage tanks. In addition, integrated safety evaluations must be performed and the effects of each design element must be determined in future research.