Fault detection and diagnosis in refrigeration systems using machine learning algorithms

The functionality of industrial refrigeration systems is important for environment-friendly companies and organizations, since faulty systems can impact human health by lowering food quality, cause pollution


Introduction
Machine Learning (ML) is a common term for many processing methods used for data-driven tasks.The main intention of ML is to enable computers to learn, predict, or decide on an unseen data without human assistance (Saravanan and Sujatha, 2018).In the 2010s, rapid development of processors, IoT, and an increasing amount of generated data paved the way for large improvements in ML capabilities.Thus, the popularity of ML increased exponentially in many industries.Machine learning is used in various contexts, such as computer vision, text classification, fault detection, language processing, image recognition, and so forth.
The idea of using ML for fault detection and diagnosis dates back to the 1980s where the existing ML methods were not as efficient as specialized experts.However, the technologies have been improved, and as of today, the availability of powerful programming tools and algorithms for self-learning allow computers to make strategic decisions and even diagnose new events (Gauglitz, 2019).
In particular, ML-based methods have been studied for fault detection and diagnosis (FDD) in different fields with promising results.For instance, ML is used for fault detection in brushless synchronous generators in Rahnama et al. (2019), in water distribution network (Quiñones-Grueiro et al., 2021), in age intelligence systems (Liu et al., 2021), and in high-temperature super conducting DC power cables (Choi et al., 2021).In Hajji et al. (2021), several supervised ML algorithms are compared for FDD in photovoltaic systems.In Hajji et al. (2021), data from non-faulty condition and five different faulty conditions are used both for training and test; and the results confirm that supervised learning algorithms can be used for fault detection and ease the FDD procedure.Moreover, machine learning models are compared for sensor fault detection in Sana Ullah et al. (2021), in which five types of sensor faults are emulated, namely, drift, bias, precision degradation, spike, and stuck faults.
For fault detection in office building systems, various data mining methods, in particular, Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Kernelized Discriminant Analysis (KDA), semi-supervised LDA, and semi-supervised KDA have been compared in Shioya et al. (2015).In Choudhary et al. (2021), different component faults in a rotating machine are classified using a Convolutional Neural Network (CNN) algorithm.According to Lo et al. (2019), in many industrial applications, good system models are difficult or even impossible to obtain due to the system's complexity or large numbers of configurations involved in the production process.The refrigeration industry is not an exception, as the system configuration varies based on different owners' demands.Hence, model based FDD is often sensitive to model parameters in such a way that small changes in the system may lead to a poor fault detection response.In such cases, ML can be a viable approach to handling unseen situations when well trained.
In Soltani et al. (2020), a CNN model is used for evaporator fan fault detection in supermarket refrigeration systems.The same system configuration and information are used in Soltani et al. (2021) to classify the same fault and investigate the robustness of the fault detection model.However, instead of CNN, shallow learning Support Vector Machines SVM and PCA-SVM classifiers are used.In Han et al. (2010), SVM and PCA-SVM are studied for the detection of 8 type of faults in a simulated vapour-compression refrigeration system in which PCA-SVM achieved a better result compared to SVM and back-propagation neural network.
In the refrigeration industry, good performance of a fault detection algorithm can be defined as high classification accuracy, low computation time, and low false positive rate.High classification accuracy ensures an accurate fault description for the technicians for quick troubleshooting, while low computation time is important because it lowers the detection time and the hardware cost.A low false positive rate increases the reliability of the fault detection model and results in lower expenses regarding service call rate.Therefore, it is essential to evaluate the FDD algorithms based on these factors.
Because of increasing usage of digitalization in refrigeration systems (RS), many companies aim for improving existing FDD performance by utilising various data.As mentioned above, FDD algorithms perform satisfactorily in many other applications; thus, data driven FDD algorithms are selected and evaluated in this work.That is, we evaluate and optimize various FDD algorithms for the purpose of selecting the best classifier for use in RS industry applications.
The main contributions of this study is summarized below:  • The best approach from an industrial perspective is proposed to detect a faulty system and localize the fault.
In this study, all sensor faults and some component faults are simulated using a high fidelity RS model.The model is already in use at Bitzer Electronics to develop and verify control algorithms.Notice that we will restrict our attention to steady state operating conditions, which are commonly encountered in industrial application such as reefer containers, cold storage houses and so on.It is acknowledged that transient operation is important in many applications as well, e.g., in supermarket refrigeration systems.However, transient behavior presents its own set of unique challenges, and is considered out of scope of this work.
The faults include positive and negative offsets in sensors as well as specific component faults; the faults are detailed in section 2. Three classifiers, namely CNN, SVM, and LDA, are compared to diagnose every selected fault.For pre-processing of the input data LDA and PCA are compared.
The results indicate that the SVM classifier is the superior method, being able to diagnose all classes with 100% classification accuracy except non-faulty and malfunctioning of expansion valve conditions which are diagnosed with 98% and 96% classification accuracy, respectively.The LDA and LDA-SVM classifiers are capable of detecting the faulty condition with 100% classification accuracy.However, these models have poor performance regarding robustness as a significant drop in classification accuracy is observed.Finally, CNN and PCA-SVM show a general lack in performance.
The remainder of this paper are structured as follows.First, refrigeration systems background and specification, as well as data acquisition and its specification, are introduced in section 2.Then, in section 3, the mathematical approaches of the classifiers mentioned above are explained.Afterwards, the specification of each model and the result of the classification is presented in section 4. Finally, the work is concluded in section 5.

Background
In general, RS are used to cool down the goods inside of an insulated room, which is called a cold room, by transferring the heat to the environment.Fig. 1 illustrates a RS in which the refrigerant runs through the pipes.In each refrigeration cycle, heat is absorbed and dissipated.The compressor receives low pressure, low temperature refrigerant gas and releases high pressure, high temperature gas to the inlet of the condenser.The condenser is responsible for dissipating the refrigerant heat to the ambient environment, and finally gives out liquid refrigerant at high pressure while the temperature decreases.Afterwards, an expansion valve decreases the pressure of the refrigerant.Low pressure, low temperature refrigerant enters the evaporator pipes in order to absorb the heat from the cold room environment.Thus, the refrigerant changes phase from liquid to gas before reaching the compressor.
Defective components or sensors in RS lead to high power consumption, air pollution, wear and tear of the components, and/or food waste.RS have the best efficiency when everything is nominal.Thus, when faults occur, the system might deviate from the peak efficiency point.By some of the fault, the system runs outside of its permitted envelope, some of the faults lead to wear and tear of the components due to high temperature, too little lubrication, and too high pressure on the components.Late fault detection may cause the temperature of the refrigerated goods to exceed the permitted limits.Therefore, early fault detection in RS ensures maintaining the required quality of refrigerated goods such as food products or medicine, and preventing excessive maintenance and spoilage cost.
The high fidelity model used by Bitzer Electronics is presented in Fig. 2. In this model, a two-stage semi-hermetic reciprocating compressor is simulated with operating speed in the range 25-87 Hz.
Here, compressor cooling capacity (V cpr ) is defined as compressor operating speed in percentage.Therefore, compressor speed under 25 Hz and full speed operation of 87 Hz are defined as 0% and 100% compressor cooling capacity, respectively.The refrigerant type is R134a, and an electrical expansion valve is simulated.Maximum cooling capacity of the cold room is 17 kW at 10 ∘ C ambient temperature (T amb ) and 5 ∘ C cold room temperature (T room ).The controller is designed so as it controls over opening degree of expansion valve (vexp) using superheat temperature (T sh ) measurements as an input.T sh is the difference between the refrigerant evaporation temperature (T 0 ) and suction gas temperature (T suc ).In addition, V cpr , evaporator fan speed (V evap ), condenser fan speed (V cond ), are controlled using the mentioned controller inputs in Fig. 2. In this paper the supply temperature (T sup ) is the same as cold room temperature (T room ) and used as set point in the simulation model.Thus, set point is the temperature of the air after transferring heat to the refrigerant.Z. Soltani et al.In Fig. 2, the main components of the model are presented with grey blocks.The red blocks indicate some of the fault inputs which are added to the corresponding parameters.Twenty types of faults are simulated, including positive and negative offsets in sensors as well as a number of component faults; the faults are described in Table 1.When collecting a data set, the model is first run with no effect of the red blocks, thus producing non-faulty data.After logging sufficient non-faulty samples, one fault is applied to the model and data collection continues.Simulation of some of the faults such as pressure sensors offset and T dis sensor offset, are not visible in Fig. 2, since they are simulated inside of the relevant block diagrams.

Data acquisition
Machine learning models learn based on input information.Thus, the quality of training data is an essential factor.The training data should contain sufficient information to have a generic algorithm to make a correct decision when receiving a new observation.Using simulated data for training phase can, in fact, improve the verification result since it firstly allows data collection in different operating conditions, and secondly, data of specific faults can be correctly labeled, and finally, we ensure that the training data is not taken from an already faulty system with unwanted or unknown fault.
To prevent overfitting the model, the input data needs to be taken from various operating conditions in an acceptable range and under the same operation conditions for each fault.That is, the model has to be able to deal with operational variations.Generally, in RS, operations vary based on several factors, such as required temperature set point, compressor cooling capacity or heat load, compressor type, ambient temperature, etc.In this work, various data sets from different operating conditions are taken as training data.As shown in Fig. 3, the set point is changed in the range 0 to 15 ∘ C, and the heat load in the cooling room varies in the range 3 to 20 kW to obtain compressor speeds variation.Another data set is taken in which, besides set point and heat load, the T amb is varied; therefore, the data is referred to as having large operation condition range.In this data set, T amb is varied in the range 10 to 30 ∘ C to investigate how the classification accuracy differs if training data includes more variations.Then, the verification data set is collected using different operating conditions from the training conditions to investigate how the model performs classification in an unseen operation condition, see the blue block in Fig. 3.
Each fault in the system is considered a class.As introduced in Table 1, twenty faults are taken into account in this work which are all observed in the real systems.Therefore, twenty-one classes are studied, including non-faulty condition.In particular, the expansion valve faults are modeled as wrong valve positions compared to the command signal.In fault 8, the actual valve position is 120 % of the command signal, while in fault 18, the valve opens 80% of the command signal.Changes in condensing temperature (T C ) is compensated by condenser fan work, because there is a feedback control on condenser fan to keep constant pressure relative to T amb and the controller controls V cpr based on T sup .Thus, it is hard to observe any visual changes in the data characteristics during steady state response.However, in some other cases, the fault affects the controller response immediately, and the changes can be observed in the data easily.For example, fault 6, which is shown in Fig. 4, clearly gives rise to variations in T dis , and V cpr .The compressor works based on the controller command.In the case of fault 6, ρ and/or P suc which are fed into the controller are measurements of the faulty sensor.Therefore, the compressor behavior is based on the faulty sensor measurement.However, as the real P suc less than required, it causes drop in mass flow rate.In Fig. 2, the P suc offset is applied only to the sensor reading.The controller controls both expansion valve opening degree and compressor speed to reach a desired pressure, and when the reading is positively offset the controller must lower the actual suction pressure to reach the desired reading.

Data specification and dimensionality reduction
The idea behind dimensionality reduction techniques is to remove dependent and redundant features from original data by projecting data  to a lower-dimensional space, which holds only essential information.These approaches deal with noisy data and reduce the computation load for classification purposes (Soltani et al., 2021).In this work, the input data has 14 feature vectors or dimensions, including sensor signals, and some of the variables from RS controller, including superheat temperature, saturated evaporation temperature, compressor cooling capacity/speed, condenser fan speed, and vapour density.Statistical approaches such as PCA and LDA are used to reduce the input data dimensions before passing them through the classifiers.In this paper, all transient part of the data is removed, both for training and validation data.The 14-dimensional data is reduced to 2-dimensional data using PCA as the input to the SVM.LDA is also used for dimensionality reduction and transfers the data into a 6-dimensional data set before sending the data into the SVM classifier.Moreover, CNN and SVM are also applied to the 14-dimensional data set.For SVM and LDA classification, each class of data contains 1200 samples, and for the CNN classifier, 18000 samples with a sample rate of 1 Hz.Remark that LDA and SVM are shallow learning neural networks which, as an advantage, do not require as many samples as CNN.Too many samples result in too high computation load and low classification accuracy.As described in 2.1, the training data of each class contains various RS operating conditions.These varieties prevent overfitting and increase the model's capability for the classification of unseen operating conditions.

Methods
SVM, LDA, and CNN are all supervised learning methods which are sub-fields of the linear classifiers (Saravanan and Sujatha, 2018).Supervised ML classifiers categorize a new data set using a pre-trained model.Thus, the model is first trained using input data and defined labels.
CNN is a deep learning classifier commonly used for image processing purposes.A CNN is comprised of two phases of feature extraction and classification.The input data consists of feature vectors χ φ ∈ R n×1 , φ = 1, ⋯, c which are gathered in data matrices X κ ∈ R n×c , one for each class κ, κ = 1, …, ν.The numbers n = n κ and c quantify the number of samples in each class and the number of features, respectively.For convenience, it is assumed that all the data matrices have the same dimensions, although this is not a strict requirement.
In the feature extraction phase, the CNN makes use of so-called neurons which take data matrices X κ as input and return (neuron) where S is the number of neurons (see Fig. 5).Each neuron has a weight matrix W k ∈ R n×c and a bias matrix b k ∈ R n×c associated with it.For each κ, X κ is partitioned into n Then the neuron output y k κ is a matrix whose entries are defined as: where ⊙ denote element-wise multiplication of matrices, 1 denotes a vector of ones, and f : R→R is an activation function.
It is noted that the size n × c and number S of W k 's are hyperparameters, which can be tuned during the design of the CNN model to optimally filter different information of the input.
As illustrated in Fig. 6, the output of the feature extraction phase contains the essential information of the input data.This output is then before being used as input to the classification phase, which is a fully connected Multi-layer Perceptron, see Geidarov (2017), and Bishop (2006) with N MLP fully connected layers.The output vector of each MLP layer Y l ∈ R n l ×1 is computed recursively as where W l ∈ R n l ×n l− 1 is a layer weight matrix, b l ∈ R n l ×1 is a bias vector, f : R n l →R n l is the l'th layer's neuron activation function, and n NMLP = ν.
The output ŷ ∈ R ν of the CNN is generated by the so-called Softmax activation function where the κth coordinate of ŷ is given by: with Y NMLP κ being the κth coordinate of Y NMLP .Here, it is noted that since the CNN output is normalized ( ∑ ν κ=1 ŷκ = 1), ŷκ may be considered as the probability of a new input X belonging to class κ.
During the training process, the estimation of the classes are compared with the true labels y κ using a loss function.The loss function is also a hyper parameter that needs to be determined for the model; a common loss function is cross entropy: The training process aims at adjusting the weights in such a way that better prediction of the correct class is achieved.In other words, the minimum loss is obtained.
Minimization of the loss function can be done using different optimization techniques; the most common being Backpropagation (Bishop, 2006), which is a variant of gradient descent.Once the weights have been adjusted to yield the optimal output for a validation data set, this model can be used to classify unlabeled, new data.

LDA classifier
Linear discriminant analysis (LDA) can be used both for dimensionality reduction and classification purposes.In LDA, as it is depicted in Fig. 7, linear separation of classes is done after projecting data onto another space.LDA seeks a large separation between transformed classes compared to the original one after the dimension of the transformed data is reduced.A transformation matrix is obtained by use of the between-classes variance and the variance within each class (Bishop, 2006).
The variance between classes S B ∈ R c×c is calculated as follows: where μ κ ∈ R 1×c is the mean value of class κ, and μ ∈ R 1×c is mean of all μ κ .Afterwards, the within-class variance S s ∈ R c×c is calculated by where (X κ ) j is the jth row (or sample) in X κ .S s and S B are used to find the transformation matrix Ω ∈ R c×c defined Fig. 5.A feature extraction layer of CNN, a sub-matrix (x κ ) ij is convolved with each weight matrix W k , resulting in a number of matrices as the output of the layer.
Z. Soltani et al. as Afterwards, this transformation matrix is used to generate data in another space in which the classes are linearly separable.In order to reduce the dimensions of the data in the new space, eigenvectors and eigenvalues of Ω are obtained.The eigenvectors with higher eigenvalues carry more information of the data distribution (Tharwat et al., 2017).
Order the eigenvalues of Ω in decreasing order the first α ≤ c corresponding eigenvectors and organize them in a new The lower-dimensional samples r j ∈ R 1×α , j = 1, ⋯, n in class κ are then the rows of the matrix product X κ V.

SVM classifier
Support vector machine (SVM) is a supervised machine learning method and linear classifier which classifies data into two or more classes.In the sequel we focus on the case of two classes.
Consider the two classes X κ , κ = 1, 2 containing the samples as rows and set y j = − 1 or y j = 1 if x j ∈ R 1×c is a row in X 1 or a row in X 2 , respectively.Assume that the two classes are linearly separable, that is, the samples of each class can be separated by a (linear) hyper plane.Then there exists a hyper plane with weight w ∈ R 1×c and bias b ∈ R such that 1/ ‖ w ‖ is the distance from H to the nearest sample in class 1 and class 2. These nearest samples are usually called support vectors (see Fig. 8).Moreover, w and b may be found as the solution to the optimisation problem The optimal (or hard) margin (that is, 1/ ‖ w ‖ with w the solution to (8)) may not always lead to the best result when feeding unseen data to the model.The optimal margin might result in overfitting or margin violations.In particular, outliers can fall into the wrong class and be misclassified (Murty and Raghava, 2016).In practice, the classifier is allowed to do small misclassifications during the training, which is called soft margin (shown in Fig. 8).To do so, a slack variable ζ is added to the optimization problems: where C is a hyper parameter that determines the size of the allowed misclassification.The size of the parameter C is tuned by software such that the classification accuracy of unseen data is high.In many classification problems a linear classification is not possible.The kernel trick is a method for dealing with this case.It yields a transformation of the input space, that is the space which the samples belong to, into another higher dimensional space, in which the samples are linearly separable (Murty and Raghava, 2016).This new space is typically called the feature space.The kernel trick relies on the use of kernel function.In this work we consider a special class of kernel function, called the Gaussian Radial Basis Functions (GRBF) given by The hyper parameter γ > 0 determines the influence of each sample on selecting the hyper plane during training.It should be noted that choosing γ too big results in overfitting and choosing γ too small leads to under-fitting of the model (Bishop, 2006).

Multi-class classification
In the case of more than two classes, the problem can be solved using two approaches.The first one is to consider each class against the rest of the classes and is called One Versus the Rest (OVR).For the model training using OVR, one binary classifier is used for each class against all the other classes as the second category.Therefore, for a data set including ν classes, ν binary classifiers are created.For unseen data classification, each classifier is tested to determine to which class the new sample belongs.However, in many cases, the result of OVR is inconsistent as the sample can belong to either more than one class or none of them, illustrated as the gray stars in Fig. 9. Since, OVR picks one class against all other classes together, the number of samples in the corresponding class is typically a lot fewer than the rest of the classes.Therefore, the big difference between the number of samples often impacts the decision boundary (Bishop, 2006).
The second multi-class classification approach takes each class versus another and is called the One Versus One (OVO) approach.Thus, for each pair of classes, one classifier is trained.Finally, ν(ν− 1) 2 classifiers determine each class boundaries as shown in Fig. 9.
The OVO approach is not as computationally effective as OVR due to using more classifiers.Moreover, the OVO approach has a tendency to overfit (Platt et al., 2000).However, in the end, a certain amount of trial and error is unavoidable in selecting a multi-class SVM classifier, as it depends on the input data and feature space.

Experiments
In this work, PCA and LDA are built in Python for dimensionality reduction purposes.It is advantageous to use lower-dimensional input data if it reduces the computation time of the classification and/or increases accuracy by removing redundant information in the data set such as noise, etc.This work tests and compares PCA-SVM and LDA-SVM models to the SVM classifier with full-dimensional data.The algorithms are built using the scikit-learn library in python which provides many efficient algorithms in ML, dimensionality reduction and classification.In Aurélien (2019), the ways of implementing aforementioned ML techniques in the scikit-learn library are described.In this work, the label -1 is assigned to non-faulty data, while other labels are specified in Table 1.Moreover, the classifiers are fed with two sets of training data which are described in section 2, in order to evaluate the qualification of the training data.

Full-dimensional classifiers
The input data used for the SVM model includes n = 1200 samples of 14 feature vectors for each class.In addition, the input data contains samples from different system configurations.Each sample is labelled with one of the labels in Table 1.The SVM classifier performs OVO classification using C = 1000, and γ = 0.01 (see section 3.2); the hyperparameters were found by trial-and-error.The result of classification is represented in Fig. 10.True labels are the labels assigned to each class during the training phase, while Predicted labels refers to the prediction of the classifier during the training process.Thus, the diagonal values represent correct classifications.In this test, 250 samples with 1 Hz sample rate are selected for prediction.
The SVM result shows high classification accuracy for most of the classes, and there are no false positives.At 93% accuracy, the broken compressor with label 17 in Table 1 is the only fault that is misclassified.
As mentioned in section 3, LDA can be used both for dimensionality reduction and classification purposes.Here, LDA is used to classify all 21 classes of data while reducing the dimensions of the input data from 14 to 5. As shown in Fig. 10, the response of the LDA classifier is very similar to SVM classification, exhibiting 100% classification accuracy for most of the classes and no false positives.The only misclassification of about 3% is the broken compressor, which is mistaken for either P suc sensor negative offset or broken evaporator fan.
CNN is a deep learning model and needs more samples compared to LDA or SVM.In the CNN model experiment, the data set for each class contains 12000 samples of all 14 feature vectors.The classification response of the training is represented in Fig. 11.The CNN classifier obtained a total accuracy of 94% and could classify most of the faults with 100% accuracy.The noticeable drawback is the false positive rate of 58%.The non-faulty condition was misclassified as classes with labels 8 and 18, which are both expansion valve malfunctions.

Reduced-dimension classifiers
In this part, PCA and LDA are used to reduce the input dimensionality.These approaches are investigated to see whether PCA or LDA can improve classification results.In addition, it is vital to study whether low dimensional inputs can reduce training computation time in the case of PCA and LDA.
After feeding data into PCA and transforming to the new space, it appears that the first two dimensions of the transformed data contain more than 80% of the variations in the new space, as seen in Fig. 12.Therefore, the first two principal components are used as the inputs to the SVM instead of 14-dimensional data.Fig. 13 shows the response of the PCA-SVM classifier with C = 1000, γ = 0.01, and OVO decision function.
The result of PCA-SVM shows misclassification of most of the classes.PCA causes classes to overlap as the most uncorrelated information is squeezed into the first two principal components.The result of PCA-SVM classification is not satisfactory for the multi-class classification even though it represents satisfactory results for binary classification in Soltani et al. (2021).
LDA is already used for classification, as shown in Fig. 13.However, it can also be used only for dimensionality reduction; then, the transformed lower dimensional data is used in a classification algorithm such as SVM.The first five eigenvectors corresponding to the first highest eigenvalues indicate that LDA reduces the input dimensions from eleven to five.A LDA-SVM classifier is built using C = 1000,γ = 0.01, and OVO decision function for the SVM part.The LDA-SVM classifier performs satisfactorily for many of the classes shown in Fig. 13.However, the As seen in Table 2, SVM and LDA achieved the best results, with high accuracy and no false positives.However, the prediction time is relatively low for the LDA classifier compared to SVM, PCA-SVM, and LDA-SVM.On the other hand, the CNN classifier has the lowest prediction time, but the false positive rate is unacceptable.Therefore, LDA is found as the best model for multi-fault classification.Afterwards, more investigation is done on SVM, LDA and LDA-SVM, which perform better during the training phase.

The classifiers verification
In this part, the validation data is specified with a set point, heat load and ambient temperature which is different from what are used for the training set.In this data set, T set is 4 ∘ C, heat load is 13 kW and ambient temperature is 17 ∘ C. Fig. 14 shows the response of SVM, LDA, and LDA-SVM classifier trained with the first training data set, with variations in set point and heat load.The overview of the results in Table 3 indicates that even though the classifiers did a good job during the training and test, they can not deal with the new data which are taken from a system in a new operating condition.Therefore, the classification results are not satisfactory, especially when looking at the false positive rate.

Effect of data variation
To deal with the challenge of misclassification of unseen data, a new training data set is fed into the same model, which contains more excitation by varying the RS operation around ambient temperature   from 10 to 30 ∘ C, set point from 0 to 12 ∘ C, and heat load from 3 to 18 kW.In addition, to obtain better results, all 14 feature vectors are tested to see if one can affect misclassification.Thus, three features of input data, namely, P suc , compressor power consumption and density that were already used, are removed from the training and validation data set as they adversely affect the classification accuracy.The results are depicted in Fig. 15.
The overview of the results in Table 4 shows that the SVM classifier obtains more accurate results after training with more excited training data and removing the three mentioned feature vectors.However, for the LDA-SVM and LDA classifiers, the most accurate results are obtained when just the power consumption of the compressor and density are removed.Using this adjustment, the false positive percentage is improved a lot and SVM stands alone regarding the diagnosis of all faults simultaneously with high accuracy.It is seen that SVM has the highest classification accuracy of 95% with a 4% false positive rate.The only class which SVM does not diagnose is the blocked expansion valve, which is misclassified with the loose expansion valve.Therefore, even though this fault is misclassified, we can still trust that the malfunctioning valve needs to be checked by the technicians.

Conclusion
From an industrial point of view, it is very beneficial to have one classifier that can diagnose twenty one classes.Moreover, the classifiers considered in this work can be trained off-line.Off-line training may have two advantages.First, It is possible to train the classifier with simulation data and use the trained classifier for classification of real data to ensure that we do not train the classifier with the real data which are wrongly labeled.Second, the trained classifier would be computationally lighter compared if the training process were to be executed on embedded software as well.This is an advantage when the capacity of the processor of typical refrigeration systems is considered.The SVM model obtained the best classification accuracy at the algorithms tested.If a lower false positive percentage is considered, LDA can be used with a 0% false positive rate only for distinguishing the non-faulty class from the other faulty classes.Therefore, the system could benefit from having two classifiers, to make the diagnosis result more reliable.Before implementation of the classifier on real refrigeration systems, verification of the trained classifier by using real data from the field will be done in the future work.

Declaration of Competing Interest
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Zahra Soltani reports financial support was provided by Innovation
Fig. 4 represents four examples of data sets taken from the same model and under the same conditions.These examples represent a non-faulty condition, a suction pressure sensor fault with 0.2 bar positive offset indicating fault 6, a loose expansion valve fault where it reacts 20%

Fig. 2 .
Fig. 2. The grey blocks indicate the main components of the RS.The red blocks are the faults or offsets that can be applied to each variable.
more than the commanded value from the controller indicating fault 8, and a blocked expansion valve that reacts 20% less than the commanded value.During data acquisition, the model is run in non-faulty condition until sample 6000.Then, each fault is introduced from sample 6001 to 12000 as seen in Fig.4.It is observed that in some cases, such as fault 8, the data looks very similar to some of the other faulty or non-faulty data.

Fig. 3 .
Fig. 3.An overview of data collection and ML setup.The red section indicates the training phase where data is collected and used for training of the ML model.The blue section shows verification data specification and classification.

Fig. 4 .
Fig. 4. Four examples of data set from different classes which have the same system configuration.The set point to T sup = T room is set to 7 ∘ C, heat load in the cooling room is 13 kW at the ambient temperature of 25 ∘ C.

Fig. 9 .
Fig. 9. multi-class data classification using OVR at the left and OVO at the right.

Fig. 12 .
Fig. 12.The first two principle components contain the most variation among all 14 principle components.

Fig. 14 .
Fig. 14.Three classification responses of validation data with different system operating condition.

Fund
Denmark and Bitzer electronics A/S. 15.Higher classification accuracy after training with new training data for all three classifiers comparing to Fig. 14.
• A deep learning and several shallow learning classifiers are proposed for detecting and diagnosing twenty types of faults in RS. • Importance of training data qualification regarding data variation and features selection is illustrated.• All of the proposed classifiers are compared regarding classification accuracy, computation time and false positive rate.

Table 1
fault types and descriptions.

Table 2
Comparison of different classifiers.

Table 3
Robustness of classifiers against different operating conditions.

Table 4
Robustness of classifiers after using qualified training data.