Image super-resolution reconstruction based on residual compensation combined attention network

: For image reconstruction, the residual network ignores part of the residual information when extracting features. We propose an image super-resolution reconstruction based on residual compensation joint attention network (RCCN). Firstly, we construct a three-way residual network for compensating the feature information of the standard residual network; secondly, we design a joint attention module to complement the pixel-level image attention information by 3D attention while the channel attention learns the channel weight information; finally, our method has clearer results compared with other advanced methods, and the objective evaluation indexes are all greatly improved.


Introduction
Single image super-resolution is a classical image recovery problem [1] that aims to recover high resolution images from degraded low-resolution images.Current single-image super-resolution reconstruction techniques can be divided into three categories: interpolation-based methods [2], reconstruction-based methods [3] and learning-based methods [4].In recent years, deep learning methods have developed rapidly and have shown great potential in the field of computer vision.Dong et al [5]first applied deep learning to the image super-resolution problem and proposed superresolution reconstruction by convolutional neural networks, which achieved end-to-end mapping between LR and HR, but the SRCNN introduced additional computation using pre-up-sampling and the 3-layer convolution also resulted in limited extracted information.To address this problem, Dong et al [6]propose a super-resolution reconstruction based on fast convolutional neural networks, which use deconvolutional layers instead of dual cubic interpolation in the upsampling process and increase the depth of the network from three to eight layers.Later, many super-resolution reconstruction algorithms based on deep neural networks were proposed.Lim et al [7]proposed super-resolution reconstruction based on enhanced depth residual networks, and EDSR removed the BN layer while using the residual network, accelerating the convergence of the network.
Although all of the above deep learning-based SR methods have obtained good reconstruction results, there are still some problems.All of these methods ignore the fact that a lot of feature information is lost by single-path forward propagation, and although the use of residual networks can alleviate this problem, there is also the problem of insufficient reconstruction details.Aiming at this problem, we propose an image super-resolution reconstruction with joint attention for residual compensation.The contributions of this paper are summarized as follows: (1) To enhance the extraction of long-range features from residual networks, we introduce a parallel multi-path extraction module.This module can greatly enhance the detail extraction capability of the network and strengthen the generalization capability of the network model.
(2) A combined attention module has been designed by using a channel attention module and a pixel attention module.This module allows the recoding of feature information in the channel and pixel dimensions to enable adaptive selection of valid information by the network and suppression of interfering information.
(3) Experimental validation by using a standard test dataset.The experiments show that our method has good reconstruction performance and generalization capability.

Residual networks
In order to make deep neural network training easier, He et al. [8] proposed ResNet, which uses jumpers to connect adjacent feature layers to ensure that the feature information in the forward propagation process is not less, solves the problem of gradient disappearing in the network training process, and greatly deepens the deep learning network, the operation process can be expressed as: x  represents the output features.

Attention mechanisms
The attention mechanism is an efficient feature selection mechanism.By generating the attention weighting function that focuses on the significant region of the feature and ignores the redundant feature, the accuracy of feature extraction can be improved by adding only a few parameters.Hu et al [9] proposed channel attention networks to improve the effective use of computational resources by adjusting the channel attention of network features so that the network focuses on useful features.Zhang et al [10] proposed that RCAN introduce channel attention and incorporate residual extraction so that the network guides the production of corresponding attention weights based on the image information of each channel.Zhao et al [11] extracted pixel attention (PA), produced three dimensional attention features to filter and introduce fewer additional parameters, and improved the reconstruction performance.

The method RCCN proposed in this paper
In this section, we introduce each module of the proposed method.Then, the whole framework of the proposed method is introduced.

Residual compensation combined attention network
In order to increase the ability of the residual network to extract feature information, this paper proposes an image super-resolution reconstruction algorithm with residual compensation combined with attention network.The overall framework of the network is shown in Figure 1.Our proposed RCCN consists of three modules, containing a shallow feature extraction module, a nonlinear mapping module for residual compensation combined with attention and a reconstruction module.
The ILR and ISR represent the input and output of the network, respectively, and the initial features of the LR image are first extracted using standard convolution to extract shallow feature information, as described below: where   0 H  is the convolution operation for the initial extracted features and 0 F is the initial extracted features.
Secondly, 0 F the feature information is extracted step by step through a composition of n RCCN end-to-end connections.In this module, features are extracted using a combination of residual compensation and combined attention, with residual compensation capturing some of the features lost by the residual network and improving high frequency reconstruction performance.The extraction process is shown in the following equation: where

Residual compensation module
As shown in Figure 2. Our residual compensation network, using group convolution and depthseparable convolution as the basic units, introduces a residual structure in the forward propagation of features to prevent the effects of gradient explosion and gradient disappearance.We then introduce channel shuffle, which reduces the effect of group convolution and depth-separable convolution on the network.
For the features of the input network, residual compensation is first performed in the low channel dimension in order to reduce the computational effort during the compensation process.Split the dimension into 1 q -dimensional and 2 q -dimensional features by using the channel dimension split function for features with input in dimension q , where dimension 1 q is equal to dimension 2 q and the sum of dimension 1 q and dimension 2 q is q .The purpose of this is to promote the fusion of information after the feature mapping dimension and increase the richness of the extracted information.The specific operation details are as follows: 12 ( ) ( ) S  denote the channel splitting operation.Then, for the input features split into two, one way is used for long path information extraction, and the other way is retained after one convolution.The long path information extraction process starts with a group convolution with a convolution kernel of 1 and an activation function for feature mapping to extract shallow information.After performing a two-layer group convolution operation, and incorporating a depth-separable convolution, deeper detailed information can be extracted.When the image feature information i F is passed into a layer of group convolution, the feature information KL F is obtained.Although low latitude residual compensation maintains the convolution operation at low latitudes, the reduced dimensionality results in incomplete feature information being extracted by the network.To obtain fuller feature information, we introduce a two-way constant dimensional residual compensation branch, shown in the right-hand two-way diagram.In the first branch of the constant dimensional residual compensation, we use the same convolution settings as in the long path information extraction part of the low latitude residual compensation, with the difference that the constant dimensional residual compensation does not compress the dimensions, i.e. no dimensional splitting of the input features is performed.In the second branch of the constant dimensional residual compensation, we use only one group convolution operation and one depth-separable convolution operation.To increase the exchange of information, both paths are followed by a channel-mixing operation, and then the feature information from the different branches is stacked together through the Add layer to fuse all the feature information.

Combined Attention Module
In traditional convolutional neural networks, the features obtained from all convolutional layers are aggregated directly to get the output, which cannot effectively use the useful information in feature extraction and focus on information helpful for image detail recovery, thus limiting the learning efficiency of convolutional networks for feature information.For this purpose the combined attention module is designed in this paper and the structure is shown in Figure 3.In the figure, the input features go through the global average pooling, full connectivity, activation function, full connectivity and Sigmoid function in turn, and this is done to obtain information on the weights of the input features.Since a single attention can only get the weight share of one dimension of image features, for this reason, we introduce pixel-level attention and propose a combined attention module for channel-level and pixel-level feature attention on the input image.As shown in the second module in Figure 3, pixel-by-pixel multiplication gets pixel attention feature information.

Experimental setup
To evaluate the effectiveness and accuracy of the proposed algorithm, we experimented with LR images and HR images, using DIV2K [12] as the training dataset, and enhanced the data with 90°, 180°, 270° rotations and random horizontal flips on the training dataset.Four commonly used standard datasets Set 5 [13], Set 14 [14], B100 [15] and Urban100 [16] were used as test datasets.All experiments in this paper are based on the RGB triple channel, and tests are performed by converting the image colour space from RGB to YCbCr, in its Y channel.

Experimental results and analysis
In order to objectively evaluate the algorithm proposed in this paper, We selected Bicubic, SRCNN [5], FSRCNN [6], VDSR [17], DRCN [18], CARN [19], MSRN [20] and other algorithms for comparison.The results are shown in Table 1.At 2x, 3x and 4x magnification, the PSNR values and SSIM values of the above algorithms are compared on the test data sets Set5, Set14, BSD100 and Urban100, respectively.1, compared with the above comparison algorithms, the proposed algorithm has significantly improved PSNR and SSIM at 2x, 3x and 4x magnification, and is more obvious at 3x and 4X magnification.Compared with the CARN algorithm in Set5, Set14, BSD100 and Urban100 test sets, the average PSNR value of the proposed algorithm is improved to 0.19dB, 0.08dB, 0.06dB and 0.28dB, respectively, when compared with the second-best one.The enhancement of SSIM value is 0.0023, 0.0010, 0.0020 and 0.0056 respectively.It indicates that RCCN can show better performance in each data set.
As shown in Figure 4, in the upper outer column of Urban image024.The reconstruction effect of RCCN algorithm proposed in this paper is clearer, and compared with other algorithms, there is no artifact phenomenon, and it is closer to HR image.In Urban image076, the transverse decoration of the window light has some distortion and deformation of other algorithms, while the reconstruction effect of RCCN algorithm has no deformation, which is a good restoration of the exterior light decoration, and the recovery effect of details is more accurate than other algorithms.

Conclusion
In order to solve the problem of insufficient extraction of residual information and insufficient use of feature information in residual network, a new super-resolution reconstruction algorithm based on residual compensation combined attention network is proposed in this paper.We designed a threechannel residual feature extraction module to extract more feature information.The main channel adopted the low-latitude residual extraction method to extract information while reducing the number of parameters.The two auxiliary channels adopted different convolution connections to extract features of different levels.At the same time, in order to make the features extracted from the residual compensation pass effectively and improve the extraction efficiency, the combined attention method is used to further select the extracted residual information, increasing the efficiency of the network, allowing better passage of low frequency information and enriching the edge information of the network.

where ixF
represents the input features divided into two paths,() the i th CRUB module for feature extraction.B C  indicates the operation of the n modules above using end connection fusion.This results in a feature image LD F for deep feature extraction.

Figure 1 :
Figure 1: RCCN network structure diagram Finally, the resulting deep image features are used as input for upsampling, by which the feature images are scaled to the desired magnification, i.e: () UP UP LD F H F 

whereKSF 2 ( 2 LG
is the short path retention feature and KL F is the feature information extracted from the first layer of long path information extraction,  denotes the second layer group convolution with activation function operation, and   C DW  denotes the depth-separable convolution operation.And function for the channel dimension, C represents the Channel Shuffle, and () E G  represents the 11  dimension-holding convolution.

Figure 2 :
Figure 2: Structure of the residual compensation module

Figure 3 :
Figure 3: Structure of the combined attention module

Figure 4 :
Figure 4: Visual quality of the Urban100 test set 4x magnification of RCCN compared to advanced methods

Table 1 :
Quantitative evaluation of 9 SR methods tested on four benchmark sets at different magnifications