Highly non-linear and wide-band mmWave active array OTA linearisation using neural network

This paper proposes a neural network (NN) ‐ based over ‐ the ‐ air (OTA) linearisation technique for a highly non ‐ linear and wide ‐ band mmWave active phased array (APA) transmitter and compares it with the conventional memory polynomial model (MPM) ‐ based technique. The proposed NN effectively learns the distinctive non ‐ linear distortions, which may not easily fit to existing MPM solutions, and can, therefore, successfully cope with the challenges introduced by the high non ‐ linearity and wide bandwidth. The proposed technique has been evaluated using a state ‐ of ‐ the ‐ art 4 � 4 APA operating in highly non ‐ linear regions at 28 GHz with a 100 ‐ MHz ‐ wide 3GPP base ‐ station signal as input. Experimental results show the pre ‐ distortion signal generated by the NN exhibits the peak ‐ to ‐ average power ratio (PAPR) much lower than the one generated by MPM and consequently superior linearisation performance in terms of adjacent channel leakage ratio (ACLR) and error vector magnitude (EVM) for high non ‐ linearity cases. Using the proposed NN ‐ based linearisation technique, an improvement of 5 ‐ dB ACLR and 7% points in EVM are achieved, which demonstrates the promising potential of this technique for emerging

compensated, so that they all exhibit the very same behaviour.By doing so, it is possible to provide linearisation in all directions with a single DPD, in contrast to linearising the main beam only.However, compensating the mismatch requires analogue circuits, which introduces complexity and delay for large arrays and the potential changes in the PAs' behaviours due to crosstalk.In the present work, the reference signal for DPD learning was obtained through measurements from a farfield test receiver placed on the main beam direction and the focus is on the challenges related to high bandwidth and high non-linearity.
For the cases where the enhanced power efficiency is required such as Doherty PAs and envelope-tracking-based transmitters, the amplifier exhibits different behaviour for different power levels.A piecewise model based on a region partition algorithm that takes the actual non-linear characteristics of the device into account was proposed in [10], which gives significantly better linearisation than the general memory polynomial models.However, memory modelling capabilities may be compromised in piecewise models as the different submodels operate independently, whereas memory effects may involve samples belonging to different sub-regions.A new piecewise model for PAs based on the mixture-of-experts (ME) approach, which builds on a probabilistic model that allows the different sub-models to cooperate, has been presented in [12].It demonstrates a model that outperforms previous piecewise modelling methods.The ME approach is a promising technique and is a highly valid approach to be compared with the neural network (NN) approach in future work.The challenges as high bandwidth and high non-linearity lead to huge complexity and explosion of MPM-based algorithms.The Volterra series model approach, which is commonly used in MPM approaches, is preferable if the order of the non-linearity is not too high (e.g.third or possibly fifth order) [13].With very high non-linearity order, the MPM approach is not practical because of the increased complexity and consequent increase of the number of unknown kernel coefficients in the model [14].Neural networks (NNs) are well known to be able to learn any arbitrary non-linear function according to the universal approximation theorem [15].Several state-of-the-art linearisation techniques based on NN have recently been introduced.A solution for performance imperfections such as crosstalk, power amplifier (PA) non-linearities along with modulator imperfections like in-phase and quadrature (I/Q) imbalance and DC-offset for a wide-band directconversion transmitter has been recently introduced in [16,17].A similar approach, where only the magnitude of the input signal undergoes a non-linear operation and the phase information is recovered with a linear weighting operation, has been introduced in [18].
For wideband signals, in particular, the memory effects have a significant impact [19,20].To take care of memory effects, two dynamic neural structures have been proposed in the NN literature [21].In the first structure, recurrent neural networks (RNNs) utilise feed-forward and feedback signal processing.In another structure, a real-valued time-delay neural network (RVTDNN) combines I/Q processing with input time-delay lines to handle memory effects, whereas RNN uses output-to-input time-delay lines.Reference [22] indicates that RVTDNNs offer superior performance and easy baseband implementation when used for inverse modelling of PAs with strong non-linearities and memory effects.For the high non-linearity cases, the model needs a low learning rate during training at the cost of the training time.In the present paper, we are using the so-called batch normalisation (BN) together with the hidden layer in order to use a higher learning rate and reduce the training time.Furthermore, the proposed RVTDNN uses the rectified linear units (ReLU) activation function, which is less computationally expensive than hyperbolic tangent (Tanh) and Sigmoid because it involves simpler mathematical operations [23].We are proposing an NN using only one hidden layer and a minimum number of neurons to make it comparable with conventional MPM.The proposed RVTDNN is applied to linearise highly non-linear multi-PA devices-under-test (DUTs) such as the active phased array.We are using the proposed NN model for a 5G DUT that includes complex interactions between the PAs in the array, such as load-modulation.Measurements quantifying this impact are included in Section 6.Finally, for the first time, to the best knowledge of the authors, a pre-distortion scheme based on the RVTDNN was validated using a real 5G test-bed environment with a minimum number of neurons and layers together with the ReLU activation function to keep the cost and size of the device during implementation as low as possible.
Figure 1 illustrates the digital pre-distortion concept for the APA based on the equivalent SISO model using the proposed neural network.The mapping relationship between the order of memory depth, the number of hidden layers and the number of neurons to the corresponding required linearity have been analysed.The optimum levels for the parameters in each block have been identified, verified through measurements, and then, the performance and the complexity are compared with the applied MPM-based DPD using the same laboratory setup.
This paper is organised as follows: Section 1 is the introduction.Section 2 describes the MPM-based approach.Section 3 is about the NN linearisation technique.Section 4 explains the NN training and parameter tuning, Section 5 is an investigation of complexity and Section 6 is about the measurement results.A discussion on comparison between measurements results of MPM and NN approaches is included in Section 7, and finally, the conclusion of this work is presented in Section 8.

| MPM-BASED APPROACH
The classical approach to modelling the full behaviour of a non-linear device is by the Volterra series, Equation (1), which describes the relation between the output and input signals in discrete time: -63 where K is the order of the non-linearity, M is the memory depth and h k (m 1 , …, m k ) are the parameters of the model, which are often referred to as the 'Volterra kernels' in the literature.The nth sample of the input signal x[n] is mixed with the M − 1 preceding samples at each of the kth Volterra kernel.
In other words, the kth kernel includes all possible combinations of k time shifts of the input signal, which includes all types of memory effects.For this reason, the Volterra series is considered as the most complete model, but the computational complexity of the model is very high [24].A much less complex model is the MPM, which is widely used for linearisation.Equation ( 2) represents the applied MPM that is a deviation of the Hammerstein model and has been proven effective for removing non-linearity and memory effect [25]: where a km is the 2-D array of filters and power-series coefficients of the active device, K is the non-linearity order of the memory polynomial model and M is the highest memory depth.a km coefficients are the linear weighting of non-linear signals and these coefficients are calculated by using the least-squares type algorithm.The generalised memory polynomial that combines the memory polynomial with cross terms between the signal and lagging and/or leading exponentiated envelope terms is presented in [25].This model shows a slightly improved linearisation effect but on the cost of complexity that needs to be compared with a more complex neural network model, that is, long short-term memory (LSTM) neural network techniques [26].In this work, we introduce the comparison between a MPM model based on Equation ( 2) and a simple neural network model to relax the overall complexity.
In [8], we have provided a detailed insight into the linearisation mechanisms for an APA based on the MPM model.A similar approach has been used for constructing the pre-distorted signals for the different non-linearity cases of the APA in actual work.The same captured input and output I and Q samples are used for both MPM and NN techniques.

| NN model
The SISO model where the entire transmitter has been considered as a two-port system has been described in [8].This model uses only one external antenna for observing the combined signal in the far field.Similarly, in the present work, the entire OTA beam-forming setup is considered as a SISO model with the APA as the main source of non-linearity.The NN is used as an inverse system for such a model and it is trained using the measured input and output data.Once the training is completed, the inverse model is used as a predistorter for the SISO model, as seen in Figure 2. If the output and the input of the SISO model are set to the I and Q training data, y(t), and true values, x(t), respectively, then the NN needs to be trained to capture the non-linearity of the model by generating the inverse of the non-linearity function, h (t), given in Equation (3): The aim of the training is to calculate the weights such that the NN gradually learns the non-linearity of the SISO model during the training procedure.When the cost is under the specified threshold or no longer converges, then the training step is finished.
After training, the functionality of the NN, denoted as u(t) in Figure 2, is an optimal estimation of h −1 (t).Ideally, using the pre-distorted signal, x(t) ⋅ h −1 (t), as input to the SISO model, the output will be a linearised function defined as G ⋅ x(t).

| NN training
The proposed NN shown in Figure 3 uses a feedforward fully connected (FC) structure.Based on the interconnection pattern or architecture, we can distinguish between feedforward networks (FNNs) and recurrent (or feedback) networks (RNNs) [27].The feedforward network is considered since it is the most used NN and according to the universal approximation theorem, it can approximate any non-linear function with any desired error [28].An FC structure in a densely populated NN may increase requirements for hardware resources, but in many applications, the weight of some interconnections can be set to zero without loss of accuracy, which results in sparsely connected layers [27].The sparse structure is out of the scope of this work.The input and output data are separated as y I [n − M], y Q [n − M], x ̂I ½n� and x ̂Q½n� where n is the number of the I and Q data used in the training.The wide-band memory effects are modelled by the delayed replica up to memory depth of M. The weights, W (i) , and biases, B (i) for the vector expressing the relation between input and output of each FC layer is defined as: where i is the ith FC layer.For an input layer of L neurons and output of P neurons, x (i) is an L � 1 vector, W (i) is a P � L matrix and B i is a P � 1 vector.Each dense layer, which is defined as (a) in Figure 3, can be described using Equation ( 4).
The weights and biases of each FC layer are distinctive and are optimised by back propagation.The optimisation algorithm used in this work is the adaptive moment estimator (Adam) [29].It is based on a gradient descent algorithm that gets more computationally efficient by using momentum and randomised batches to avoid local minima.The batch size is the number of training samples used for estimating the error gradient.A batch size, for example, 50, means that 50 samples of the training samples are used for estimating the error gradient before the weights are updated.Another parameter, called training epoch, shows how many passes have been done through the training samples with a randomly selected group of batches.The training procedure is summarised in Table 1.

| Accelerating NN training
For the models used in high non-linearity cases, the training needs a low learning rate, which, on the other hand, increases the training time.Therefore, in each hidden layer, there is also an accelerator, the so-called BN layer, shown as block (b) in Figure 3.The BN layer allows using much higher learning rates that will accelerate the training and reduce the time cost significantly [30].
The BN layer normalises the mean and variance of the outputs of the dense layer to zero and one and introduces a new mean and variance.The output of the BN layer y ̂ðiÞ is expressed by: ŷðiÞ ¼ γ y ðiÞ − E½y ðiÞ � ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where γ and β are the new learnable mean and variance parameters and ϵ is a constant parameter to prevent the equation be infinite and is set to 0.001.

| Activation function
For the NN to be able to fit an arbitrary non-linear function, the non-linearity is introduced in the form of an activation function, which is shown as block (c) in Figure 3.Both Tanh and ReLU are evaluated as activation functions in this work, where ReLU has been chosen as the activation function due to its performance, which is described in Section 4. ReLU introduces a non-linearity by deactivating negative inputs, adding sparsity to the model, and accelerating convergence [31].ReLU is defined as:

h(t) G. G G. .h( h( h h(t) h(t) ( h(t) h( h(t) t) t h( h(t) t) h(t) h(t) h(t) h( h(t) t) ( h(t) h( h(t) t) h( h(t) t) h(t) h(t) h(t) (t) ) t) t) t G.h(t) G. G. G. G G. .h( h h(t) h(t) ( h(t) h(t) t h(t) h(t) ( h( h(t) t) h( h(t) t) h( h(t) t) h( h(t) t) (t) t) t) ) t) t) t G. G F I G U R E 2
where u is the input to the activation function.In this way, the output of a hidden layer can be expressed as Equation ( 7): where x (i+1) is the input of the next FC layer.With the sequential structure, the inputs of the subsequent hidden layer can be described in terms of the current hidden layer.This procedure that goes from the first layer to the last layer is called forward propagation.

| Cost function
There are different ways, based on the type of the problem, to evaluate the difference between the real output and the estimated output, the so-called cost function.In this work, the effects of two kinds of cost functions, Huber cost and meansquare-error (MSE) cost have been investigated.In the Huber cost function, instead of minimising the cost function, jx i − xi j, the smooth cost function, L 1 , is used for regression because it is robust against gross errors [32].The smooth L 1 cost function is defined as where B is the batch size and ɛ i is defined as a combination of the squared error and absolute error The corresponding MSE cost function is defined as:

| Training process
The concept of the proposed application is illustrated in Figure 1 and the configuration of the designed NN is shown in  the distinctive non-linear distortions, it is trained as the regression model.Considering the memory effect of the active array, the memory depth, M, has a direct impact on the number of neurons in the input layer.So, there is a trade-off between the size of M and the linearisation performance.Since the output training data and the reference input data are complex values (I and Q), the number of neurons of the input layer and output layer is set to 2 M and 2, respectively.

y y y y y y y y y y y y y y y y y y y y y y y y y y y y y y y y y y y y y y y y
Adjacent channel leakage ratio (ACLR) and error vector magnitude (EVM) are used as metrics for choosing the desired parameter values in each training step.The ACLR describes the power of the leakage in the adjacent channel compared to the in-band channel power and is defined as: where P adj and P in-band are the powers of the adjacent channel and the main channel, respectively.In this way, the signal integrity can be directly assessed in the frequency domain.
The left side of ACLR is used for evaluation through the experiments in this paper.Since ACLR only measures the distributed power in different channels, another metric for inband signal quality, EVM, in terms of percentage, is calculated as: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Pðin -bandÞ error Pðin -bandÞ ref s where P(in-band) error and P(in-band) ref are the powers of the error vector and ideal signal vector in I and Q planes, respectively.All operations are realised using Python 3.8.4 on Visual Studio Code.The NN is built and trained using Keras 2.3.0-tf, and the version of Tensorflow is 2.2.0.

| Parameter tuning
For parameter tuning, 100 k I and Q samples of input and output of the active array are captured, where 70% of the data was randomly chosen for training and the remaining 30% for testing.
The memory depth, the number of neurons in each layer and the number of hidden layers should be set appropriately.If these numbers are too small, then NN cannot get the right features for the non-linearity model and there is a risk of underfitting, the same is valid if these numbers are too large, which results in overfitting.To avoid this, the memory depth has been chosen to a low number, and then, the other parameters have been initialised to get the best linearisation parameter in terms of ACLR and EVM.For achieving faster tuning and for reducing the number of multiplications in a real application, 1 hidden layer is chosen.The optimisation was continued by tuning the memory depth and keeping other parameters unchanged.Table 2 shows how the NN is configured together with the ACLR and EVM results for the different number of memory depths.The best performance is achieved by setting the memory depth to five which results in the best ACLR improvement.
Having memory depth fixed to five, the other optimisation parameters such as activation function, cost function, batch size and the number of epochs have been tuned.The results are listed in Table 3 and based on those parameters, the ReLU and the MSE are chosen for the activation function and the cost function, respectively.Further evaluation on the number of epochs shows that approximately 25 epochs are enough for the algorithm to reach its minimum cost of 1E −6 and the cost function will not improve further as shown in Figure 4.

| NN simulation results
Figure 5 shows the simulation results of power spectral density (PSD), amplitude-to-amplitude (AM-AM) and amplitude-tophase (AM-PM) distortions for the active array output.The parameters used for simulation are based on the best tuning parameters from Table 3.
Several sets of pre-distorted signals have been trained based on the final model and have been used for characterising the efficiency of the model versus the level of nonlinearity in the active array.These results are discussed in the next section. -6

| MP pre-distortion
Equation ( 2) models the behaviour of the PA, which means that the APA output can be estimated from the inputs.For predistortion, the inverse model is needed, which means that the input should be estimated based on the output that is implemented by switching the input and outputs: where x[n] is the estimated input to the APA.This can be written as a vector-vector product in the form: where Since the absolute value, |⋅|, requires three multiplications, the complexity of finding r is thus: The complexity of the vector-vector product x[n] = rw is simply: where C Mul,MP,complex, vector is the number of complex multiplications and C Add,MP,complex, vector is the number of complex additions.The total number of complex multiplications becomes:  A complex multiplication takes four real multiplications and two real additions and a complex addition involves two real additions.This means that the total complexity of the MPM pre-distortion in real operations is:

| NN pre-distortion
The complexity analysis is made with a starting point in Equation ( 4) with L as the number of outputs of the previous layer, and P as the number of inputs to the next layer.If only fully connected layers with equal amounts of neurons are considered, the problem can be further reduced as P = L.
Between each fully connected layer, there are P 2 multiplications and P 2 additions.The number of operations between the input layer and the first hidden layer is 2MP multiplications and additions, where M is the memory depth.Between the last hidden layer and the output layer, there are 2P multiplications and additions.Thus, the total amount of multiplications and additions is: where J is the number of hidden layers.Equation (22) shows that complexity scales quadratically with the number of neurons if there is more than one hidden layer.The complexity grows linearly with the number of neurons if only a single hidden layer is used.According to the universal approximation theorem, a single hidden layer can be used for arbitrary function approximation, so for applications where low complexity is required, a single hidden layer may be desirable.

| Complexity comparison
For the MPM, the pre-distorted signals based on Equation ( 2) with various non-linearity orders, K, and memory depths, M, have been constructed and evaluated in the lab and the optimal values of K = 5 and M = 8 have been chosen.The NN is trained to reach the minimum MSE of approximately 1E-6 as an example shown in Figure 4.By sweeping the NN parameters, one hidden layer and 100 neurons and a memory depth of five have been chosen.Table 4 shows the computational effort in terms of multiplications and additions based on Equations ( 20)- (22).Although the number of multiplications is higher in the case of NN compare to MPM, the absolute number is still very low, and besides, NN has superior linearisation performance, which is shown in Section 6.

| OTA measurement setup
The block diagram of the measurement setup is shown in Figure 6 and the actual laboratory measurement setup is illustrated in Figure 7.
The R&S SMBV100 B Vector Signal Generator and its arbitrary waveform generator function generate the TX input IF signal, centred at 3 GHz, which is a 100MHz bandwidth 5G NR signal.It is a 3GPP downlink OFDM modulated waveform with 64-QAM sub-carrier modulation, sub-carrier spacing of 60 kHz and 1584 active sub-carriers.With an oversampling factor of 6, the sample rate of the transmitter and receiver signals is 600 MHz.The peak-to-average power ratio (PAPR) of the input signal, after capturing and loading to the generator, is 11.6 dB.For up-conversion and down-conversion, an un-modulated signal of 12.5 GHz has been generated by an Agilent E3247 C and frequency-doubled to 25 GHz using a MITEQ-MAX2M200400 and fed into a power divider to be used as a local oscillator (LO) signal.Two active mixers, KTX321840 and KRX321840, operating in their highly linear region, are utilised for up converting the IF signal to the 28-GHz carrier frequency and for down-converting the signal back to IF.A 28-GHz band-pass filter is used to select the up-converted modulated signal and suppress the LO leakage and image frequency signals.The Ducommum APH-26 063 325 is used as a pre-amplifier.The pre-amplifier is a high-power device and while operating more than 10 dB below its compression point, the output is linear and the power is sufficient to drive the 4 � 4 APA, AAiPK428GC-A0404 [33], close to its saturated region.The APA includes four Anokiwave AWMF-0158 [34] and integrates 16 branches of attenuators, phase shifters, PAs and 16 patch antennas in a 4 � 4 active phased array.The APA is designed for a typical main beam power of +33 dBm.
The diagonal length of the active array antenna is approximately 4 cm, which at 28 GHz results in a far-field distance of: where D is the diagonal length of the antenna and λ is the wavelength.The main beam signal is captured by the observation horn antenna placed 55 cm away, which is well above the far-field distance of the device.After down-conversion to IF, the signal is captured by the R&S FSW Signal and Spectrum Analyser and converted to the base-band.A host PC running in Matlab and using the R&S ARB Toolbox is used for capturing and uploading the I and Q samples.The measurement setup is power calibrated in order to keep all other components in their linear operating regions and the only source of non-linearity is related to the active phased array.For controlling the main beam of the array, the code-book and software tools from Amotech of five.For NN, four corresponding pre-distorted signals based on the parameters in Table 3 have been constructed and used as the input to the APA for the measurements.

| OTA measurement results
In this section, we present the experimental results of using the NN-based DPD and compare that with the MPM-based DPD.
In both cases, we do the measurements on four different power levels where the APA is driven into compression.The radiated power has been adjusted to get the ACLR slightly worse than the limit for systems operating at FR2, which is −28 dBc [35].The main beam power of the APA is the sum of the transmitter power and gain of the antenna array: and is measured as the power at the observation horn antenna placed 55 cm away, adding the propagation loss and subtracting the gain of the observation horn antenna.The pre-distorted signals are fed as input to the APA and the corresponding ACLRs and EVMs are measured for each case and each technique.
The APA is driven in four different linearity cases with main beam power to be 34, 33, 32 and 31 dBm for very high, high-, medium-and low-power cases, respectively.Due to the limited output power of the 4 � 4 APA, we are not able to increase the main beam power further due to the risk of damage to the device.However, with the main beam power of 34 dBm, the device is in the saturated region and the ACLR is approximately 2 dB worse than the 3GPP limit and suitable for our analysis.Measured OTA spectra with NN and MPM-based linearisation for four different cases are shown in Figure 8 and the ACLR and EVM improvements are illustrated in Figure 9. Here, we can see that for higher non-linearities, the NN is performing better than MPM.In the case of low non-linearity, the MPM performance is equal or slightly better, but it is worth pointing out that linearisation is less meaningful for relatively linear and less power-efficient operations.The results demonstrate that the proposed NN is capable of effectively learning the distinctive highly non-linear distortions, which may not easily fit to existing MPM solutions.

| ACLR and total radiated power (TRP)
Even though an existing reference claims the distortion is beam-formed in the same direction as the intended signal with a multi-antenna transmitter in the single-user case [4], quantitative results are desired to evaluate if using main beam Adjacent channel leakage ratio (ACLR) is a valid method for characterising the linearisation performance of a beam steerable array.For evaluating this, we performed a total radiated power (TRP) ACLR measurement and compared the results with main beam direction measurements.The TRP is defined from the integration of signal power over the angular domains.The estimated TRP for a discrete set of measured directions is defined as [35]: where N and M are the number of azimuth angles, ϕ n , and elevation angles, θ n , respectively, and EIRP(ϕ n , θ m ) is the radiated power in each angular case as a sum of both linear polarisations.The TRP-ACLR in a linear scale is calculated as total radiated power of the adjacent channel divided by the total radiated power of the in-band channel: The block diagram and lab setup for the measurements are shown in Figures 10 and 11.The following procedure is applied for all specific angles θ and ϕ: 1. Place the APA at the positioner and align the coordinate system.2. Align the beam of the APA to the desired beam steering angle.3. Measure the main channel power and adjacent channel power using a spectrum analyser.4. Repeat steps 1-3 for all directions in the TRP measurement grid. 5. Calculate TRP-ACLR according to Equation (26).
The steering angle of the APA is set to 0°.The position of the APA is changed by θ from 0 to 180°in steps of 10°and for each step, the ϕ angle is changed from −90 to 90°in 20 steps.The in-band channel power and the ACLR power for each angular position have been measured and the TRP-ACLR level has been calculated according to Equation ( 26) to be −35.0dBc.For the same setup, the main-beam-only level of ACLR is measured to be −33.3dBc.In our work, we assume the main beam pointing is maintained in communication.Moreover, since the measured difference is less than 2 dB, the main beam ACLR is chosen as the metric for experimental validation in this work.

| ACLR and beam directions
Due to interactions between the PAs in the array, the linearised beam is sensitive to the steering angle, so the impact of beam  steering needs to be quantified.A single trained DPD is not sufficient for maintaining a low distortion in a wide range of steering angles.To maintain a low level of distortion across the steering angle, a new training after some degrees shift (based on the actual setup) of the main beam is required [8].The same investigation has been done in [3] with the same conclusion.Furthermore, remarkable results in [36] show that the NN is capable of modelling the correlation between the non-linear distortion characteristics among different beams.This allows providing consistently good linearisation regardless of the beamforming direction, thus avoiding the necessity of executing continuous digital pre-distortion parameter learning.
In this work, we have quantified load modulation impact by measuring on an over-the-air test setup in a compact antenna test range (CATR).Figure 10 shows the setup used for measuring the APA output over the air in the CATR.In this experiment, 17 different values of the steering angle θ were used in the range of −78 to +78°with a step of approximately 8°.The following procedure is applied firstly to capture the non-linear data for all angles and secondly the linearised data for all angles.For all specific angles θ 1 to θ 17 , the following steps are used: 1. Adjust the steering angle to θ i according to code-book and software tools.2. Adjust the mechanical angle accordingly.3. Measure input/output data for each steering angle.4. Make the MPM and the NN pre-distorters based on predistortion coefficients obtained from measurements of the 0°steering angle. 5. Use the pre-distorter as input and repeat steps 1-3.
The results are shown in Figure 12, as measured for the same power level corresponding to the highly non-linear use case depicted in Figure 8b.The ACLR of the APA without linearisation is varying with the steering angle due to changes in radiation patterns and because of load modulation.Furthermore, the ACLR improvement rate for the linearised signals is varying with the steering angle also as a result of load modulation.

| Time-domain comparison of NN and MPM pre-distortion signals
In this section, we compare the pre-distorted signals of the NN and MPM DPD to understand better what the NN does differently.A possible explanation for this can be found by inspecting the signals in the time domain.Figure 13 shows the complex envelope of the reference input signal, the non-linear signal, the pre-distorted signal and the response after predistortion for the two power levels, 31 and 34 dBm indicated as low and high linearity cases, respectively.The gain is normalised to 0 dB for comparison.
The pre-distorted signal, as expected, has extra gain to counteract the decreasing non-linear gain at the points where the non-linear signal is in compression, which is illustrated as the high peaks in the time domain.As a consequence when the pre-distorted signal is applied, the response of the predistorted signal should ideally end up on top of the reference signal.This is exactly what happens in Figure 13a,b, where there is almost no difference between the reference input signal and the measured response after pre-distortion.The case of high non-linearities can be seen in Figure 13c,d, where the difference between the reference input signal and the response after pre-distortion can now be easily observed.It is clearly seen from Fig. 13c that the MPM technique overcompensates the compression.When comparing the NN approach with the polynomial one, the polynomials have inherent local approximating in contrast to the global approximation

APA Controller
F I G U R E 1 1 Total radiated power measurement setup using compact range chamber.APA, active phased array F I G U R E 1 2 Measured adjacent channel leakage ratio (ACLR) performance of memory polynomial model (MPM) and neural network (NN)-based digital pre-distortion versus steering angle using the over-the-air setup in a compact range chamber capability of NNs when modelling strongly non-linear systems.Therefore, NN may adapt better to extrapolating beyond the zone exploited for parameter extraction [37].
Although this effect is still under the investigation, we see clearly the impact of the pre-distorted signal's PAPR on overall linearisation and, as a consequence, the shortcoming of the

| DISCUSSION
The phenomenon of differences in the PAPR in MPM versus NN DPD was observed for all measurements where the output power of the APA in our setup is above +32 dBm.
During experiments, we kept the PAPR of the input signal as defined in 3GPP, that is, 11.3 dB without providing any clipping and filtering method to reduce the PAPR.The root cause of the difference between MPM and NN approaches can be explained as: 1.The MPM non-linearity kernel is a polynomial.For high non-linearities high-order polynomials are necessary.In our MPM, we used a non-linearity order equal to five and a memory depth of eight.When trying to linearise a highly compressed, deeply saturated PA characteristic, such a highorder polynomial non-linear function, quickly explodes at the upper side of the input amplitude range, thus causing the peaks of the pre-distorted signal to reach extremely high values, and hence leading to a huge increase in the predistorted signal PAPR compared to the PAPR of the original input-modulated signal.2. The non-linear kernel of the NN does not contain inherently such an 'explosion' effect for high amplitudes.It is important to keep in mind that the proposed RVTDNN structure is based on supervised learning.While we train the NN, we are actually using low envelope fluctuations, that is, the desired I and Q at the output layer in Figure 3, which allows the NN to learn the characteristics for a signal with low envelope fluctuations.This is also related to the fact that the long-term memory is built into the RVTDNN through supervised learning.This kind of long-term memory can be used to simulate the slow dynamic changes of non-linear characteristics of the PA over time, mentioned in [21].

| CONCLUSION
This paper presents how a neural network (NN)-based linearisation technique behaves on the digital pre-distortion (DPD) of a highly non-linear active phased array (APA) using a wideband 3GPP 5G mmWave base-station transmitter signal and compares it to the used memory polynomial model (MPM) technique.The proposed design is implemented in a state-of-the-art 4 � 4 APA and a setup using up-and downconversion from sub 6 to 28 GHz and having high nonlinearity of the active phased array as the main impairment factor.The NN is built and trained using a Python simulation environment.The performance of the optimal NN predistorter was assessed with measurement results and compared to the MPM-based DPD technique.Measurement results on the proposed NN technique show that in the case of very high non-linearity with an adjacent channel leakage ratio (ACLR) of −26 dBc, the pre-distortion signal generated by the NN exhibits peak-to-average power ratio (PAPR) much lower than the one generated by MPM and consequently is still capable to linearise the APA where it is not possible for the actual MPM technique.The proposed NN-based DPD technique applied on a highly non-linear APA with an ACLR of −28 dBc shows an improvement of error vector magnitude (EVM) of 7.2% points and ACLR of 4.7 dB.For the same setup, an MPM-based DPD can only achieve an improvement of EVM and ACLR of 4.4% points and 2.8 dB, respectively.
In the future, we may include an investigation of the robustness of NN-based linearisation due to the steering angle and the impact of channel properties for the high nonlinearity cases.

1
Concept illustration of the digital pre-distortion for active phased array (APA) based on the equivalent single-input single-output model using the neural network.DNN, delay neural network 64 -JALILI ET AL. yðtÞ ¼ G ⋅ hðtÞ ⋅ xðtÞ; ð3Þ where G is the constant gain.
Linearisation technique based on a neural network.(a) Beam-forming behavioural single-input single-output (SISO) model; (b) neural network (NN) based on the equivalent SISO model where G is the constant gain and h(t) is the non-linear function; (c) Applying trained NN in pre-distortion and linearisation

Figure 3 .
As shown in Figure1, the parameters of the NN are updated step by step by reducing the losses between outputs of the NN (i.e., predicted values) and the reference inputs.The NN can gradually learn features hidden in training data for classification or regression missions.Generally, if the NN is trained as a classifier, the crossentropy function is a commonly used cost function.For linearisation of active circuits where the NN needs to learn

F I G U R E 4
Model loss versus the number of training/validation epochs

F I G U R E 5
Simulated linearisation results with and without NN predistortion: (a) Power spectral density; (b) amplitude-to-amplitude (AM-AM) distortion; (c) amplitude-to-phase (AM-PM) distortion

F I G U R E 6
Abbreviations: MPM, memory polynomial model; NN, neural network.

F I G U R E 9
Comparison of neural network (NN) versus memory polynomial model (MPM): (a) adjacent channel leakage ratio (ACLR) comparison; (b) error vector magnitude (EVM) comparison F I G U R E 1 0 diagram of the measurement setup for measuring active phased array (APA) in a compact antenna test range

F I G U R E 1 3
Time-domain representation of pre-distorted signal: (a) memory polynomial model (MPM) low non-linearity case; (b) neural network (NN) low non-linearity case; (c) MPM high non-linearity case; (b) NN high non-linearity case conventional and less complex MPM for linearising highly non-linear 5G modulated signals, whereas a less complex NNbased technique can do the job satisfactorily.
Configurations and training for optimising memory depth T A B L E 2Abbreviations: ACLR, adjacent channel leakage ratio; EVM, error vector magnitude.JALILI ET AL.
3 NN parameter optimisation.Parameters include activation function, cost function, batch size and epochs size Abbreviations: ACLR, adjacent channel leakage ratio; MSE, mean-square error; NN, neural network.a Sorted based on decreasing ACLR improvement.