Research on Residual Convolutional Neural Network for Handwritten Digit Recognition

: The technology of handwritten digit recognition has been widely applied in various situations and has significant practical significance. However, the morphological features of handwritten numbers are very complex, and achieving accurate recognition of handwritten numbers relies on efficient and accurate recognition techniques. This article proposes a residual convolutional network model to address the issues of inaccurate feature extraction and weak model generalization ability in convolutional neural networks. By introducing residual blocks into the network, the problem of vanishing and exploding network gradients is effectively eliminated. At the same time, the Batch Normalization and Dropout layers are introduced to accelerate the network training process and reduce the risk of overfitting. Finally, the k-fold cross validation method was used to select the optimal parameter configuration of the model. The experimental results show that residual convolutional neural networks have the characteristics of high recognition accuracy and strong model generalization ability.


Introduction
Handwritten digit identification, categorization, and processing technology will replace manual extraction of digital information in a variety of industries, considerably enhancing productivity as it serves as the core and key of digital automation systems.Research on handwritten digit recognition, however, has the potential to evaluate and validate certain fresh hypotheses that have great theoretical relevance for challenges like Chinese and English character recognition.For the recognition of handwritten numerals, several academics have developed many recognition algorithms, including the BP neural network, the self-coding network, the convolutional neural network, etc.There are still some issues that affect recognition performance and result in low accuracy, such as the uncertainty of picture noise interference, despite the gradual increase in recognition rate and improvement in model performance; number recognition differs from text recognition in that it lacks context and can only be performed on the character itself, without the aid of any other recognition techniques; unable to strike a balance between speed and accuracy [1][2][3].
Zhou proposed to extract the curvature features of character contours.This method uses a "17/8/2" Back Propagation (BP) neural network for recognition.Due to the similar curvature of some numbers (such as "0" and "8", "2" and "4", etc.), the recognition rate is not too high (less than 95%) [4].To solve the problems of long training time and local optimization in BP neural networks, scholars such as Wei Henghua proposed an improved genetic algorithm and used it to optimize the weights and thresholds of artificial neural networks.Based on this algorithm, a "256/16/10" BP neural network was constructed and effectively trained on the USPS handwritten digit sample set [5].Liu Yang and other researchers used variable step method and Newton method to improve the BP algorithm, which improved the convergence speed of the network.The network convergence speed of this method is significantly faster than other improved algorithms.On this basis, the BP neural network recognition model will be applied to the digital recognition system.Although this method improves the algorithm speed, it requires more memory storage space, and the recognition rate of handwritten digits has not yet reached a very high level [6].Yang et al proposed an online incremental learning algorithm based on support vector machines.This method calls the LIBSVM classifier training function and sample recognition function, and retrains the classifier with unrecognized samples as incremental data.The experimental results show that incremental training can improve sample training speed and improve the accuracy of handwritten digit recognition while taking into account the appearance of new input samples [7].Convolutional neural networks-based handwritten digit recognition was proposed by Li Sifan and Gao Faqin.A better convolutional neural network model is created, and the MNIST character library is used to test the new model.The outcomes demonstrate the simplicity of the enhanced network topology, the reduced preprocessing workload, the great scalability, the quick recognition speed, and the high recognition rate.The recognition performance is noticeably better than the old methods, and it can successfully prevent the network from overfitting [8].In summary, for the purpose of recognizing handwritten digits, it is frequently challenging for a single recognition model to accomplish both speed and high accuracy, whether utilizing BP neural networks or the support vector machine SVM algorithm.Traditional approaches still rely on hand-extracting sample features, and statistical features-based identification algorithms struggle to identify characters in an image when their shapes are similar.As a result, feature extraction is a primary area for improvement.Traditional artificial neural networks' design and implementation frequently rely too heavily on experience and generalization performance, making it difficult to guarantee the output would be optimal when used for actual handwritten digit recognition [9].As a result, strengthening generalization capability and refining the network model are also important areas of advancement.

Deep residual learning
Deep convolutional networks have made breakthrough progress in image classification tasks.In recent years, research has shown that the depth of neural networks has a crucial impact on classification performance.The network models that have achieved good results on the ImageNet dataset are all based on deep neural network models.As the depth of the network increases, a subsequent problem is gradient dispersion.The use of normalization initialization solves this problem to some extent and can make the number of stacked layers reach tens of layers [10].
As shown in Figure 1, set the input to x. Assume that the ideal map is f(x), which is the input to the activation function above Figure 1.The portion in the dashed box only needs to fit the residual mapping f(x) -x related to the identity mapping.Residual mapping is often easier to optimize in practice.Using identity mapping as the ideal mapping f(x).Simply set the weight and deviation parameters of the weighted operations (such as affine) above the dashed box in Figure 1 to 0, and then f(x) is an identity map.In practice, when the ideal mapping f(x) is very close to the identity mapping, the residual mapping is also easy to capture the subtle fluctuations of the identity mapping.Figure 1 is also the basic block of ResNet, which is the residual block.In residual blocks, input can propagate faster forward through cross layer data lines.This short connection form neither introduces new parameters nor increases computational complexity [11].(1) Batch normalization of fully connected layers Usually, we place the batch normalization layer between the affine transformation and the activation function in the full connection layer.Let the input of the full connection layer be u , the weight parameter and deviation parameter be W and b , respectively, and the activation function be  .Set the operator for batch normalization to BN .Then, the output of the full connection layer using batch normalization is (BN( ))


x , where the batch normalization input x is obtained by affine transformation: At the same time, consider a small batch consisting of m samples, and the output of affine transformation is a new small batch in small batch B , the output of the batch normalization layer is also a d -dimensional vector: and obtain it from the following steps.Firstly, calculate the mean and variance for small batch B : () ) The square calculation is based on the element.Next, standardize x using square root by element and division by element: here is a very small constant 0   , ensuring that the denominator is not greater than 0. On the basis of the above standardization, the batch normalization layer introduces two learnable model parameters, the scale parameter γ and the shift parameter β .These two parameters have the same shape as x and are both d -dimensional vectors.They are calculated by element multiplication (symbol ) and addition with () ˆi x , respectively: we have obtained the batch normalized output x .Note that the learnable stretching and offset parameters reserve the possibility of not normalizing (2) Batch normalization of convolutional layers For convolution layer, batch normalization occurs after convolution calculation and before activation function is applied.If the convolution calculation outputs multiple channels, we need to perform batch normalization on the outputs of these channels separately, and each channel has independent stretching and offset parameters, all of which are scalars.Set m samples in a small batch.On a single channel, assume that the height and width of the convolution calculation output are p and q, respectively.We need to perform batch normalization on m×p×q elements in this channel simultaneously.When performing standardized calculations on these elements, we use the same mean and variance, that is, the mean and variance of the m×p×q elements in the channel.

Residual neural network
ResNet follows the design of VGG's full 3×3 convolutional layers.As shown in Figure 2, there are first two 3×3 convolutional layers with the same number of output channels in the residual block.Each convolution layer is followed by a batch normalization layer and ReLU activation function.Then we will skip the two convolution operations and directly add the input to the last ReLU activation function.The implementation of residual blocks is as follows.It can set the number of output channels, whether to use an additional 1×1 convolutional layers to modify the number of channels, and the stride of the convolutional layers.ResNet follows the 7×7 convolutional layer with 64 output channels and a step of 2 with a maximum pooling layer of 33 with a step of 2. ResNet adds a batch normalization layer after each convolutional layer.ResNet uses four modules composed of residual blocks, each module using several residual blocks with the same number of output channels.Each subsequent module doubles the number of channels from the previous module in the first residual block and halves the height and width [14].

Handwritten Digit Prediction Experiment
First, initialize all weight parameters using the Xavier method, set the initial values of all deviation parameters to 0, train 60000 training samples in the MNIST dataset, and adjust model parameters through back-propagation.Afterwards, transfer the training set to the trained model and observe the classification results.In the RCN model used in this article, when epoch=10, the gradient descent small batch size is 64, the learning rate is 0.01, and the momentum method parameter is chosen as 0.5.As follow, the results of the prediction experiment are shown Table 1 and Figure 5  According to the chart data, the training error gradually decreases with the iteration cycle, with a slight rebound in the sixth cycle and then continuing to decrease.The generalization error decreased significantly in the first three iteration cycles, and then continued to decline gently to about 5. The training recognition rate rapidly increased from 91% to over 97% in the first three iteration cycles, while the validation recognition rate slowly increased and remained above 97% from the beginning, indicating that the model has good generalization ability.
The training error and generalization error of the model used in this paper steadily decline when the iteration cycle gradually increases.After three iteration cycles, the training and testing correct rate are both stable at more than 98%, with high accuracy.The training process is relatively stable, with no significant error jumps.This experiment verifies the efficiency of the RCN model, which can effectively complete classification and prediction tasks.

Sensitivity analysis of hyperparameter
First, initialize all weight parameters using the Xavier method, set the initial values of all deviation parameters to 0, train 60000 training samples in the MNIST dataset, and adjust model parameters through back-propagation.Afterwards, transfer the training set to the trained model and observe the classification results.As shown in Figure 6, when epoch=5, the validation error in the first two cycles rapidly decreases, and the validation error in the third iteration cycle rebounds, but continues to steadily decrease thereafter.When epoch=10, as shown in Figure 7, the validation error in the first three cycles rapidly decreases, and the third cycle also reaches the platform stage, before stabilizing again.When epoch=15, as shown in Figure 8, the situation is similar to Figure 7.The experimental results show that, with the constant change of epoch setting, the training correct recognition rate finally reaches 99%, and the verification correct recognition rate finally reaches 98%.Therefore, the sensitivity of the hyperparameter iteration cycle is low.When it increases or decreases, the model performance does not change significantly.Secondly, control the iteration period, keep the learning rate and momentum method parameters constant, and take small batches as 16, 64, 128, and 256, respectively.When batch_size=16, as shown in Figure 9, the validation error has slightly increased in the third, fifth, seventh, and ninth cycles, but the overall trend is decreasing with the iteration cycle, which is not stable enough.When batch_size=64, as shown in Figure 10, the validation error rapidly decreased in the first three cycles, and also reached the platform stage in the third cycle, before stabilizing again.When batch_size=128, as shown in Figure 11, the training error is greater than the verification error.At the beginning, the training error is very large, which drops rapidly in the first two cycles, and then the two errors continue to decline gently.The model shows excellent generalization performance.Generalization error eventually converges to about 5, but the training error is higher than the previous super parameter selection.When batch_size=256, as shown in Figure 12, the situation is similar to Figure 11  Then, control the iteration period, keep the parameters of the small batch and momentum methods unchanged, and take the learning rates as 0.005, 0.01, 0.1, and 0.3, respectively.When learning_rate=0.005, as shown in Figure 13, the training error and verification error both decrease with the iteration cycle, but the training error is always high, the generalization error is far greater than the training error, and the model has a certain degree of overfitting.When learning_rate=0.01, as shown in Figure 14, the validation error rapidly decreased in the first three cycles, and also reached the platform stage in the third cycle before stabilizing again.When learning_rate=0.1, as shown in Figure 15, the training error rapidly decreases in the first four cycles, followed by a brief plateau period, and then slightly increases.When learning_rate=0.3, as shown in Figure 16, the training error is very large, the generalization error is very large, the network model is divergent, and the classification task cannot be completed.The experimental results show that the performance of the model changes greatly when the super parameter learning rate changes.When the learning rate decreases, the model appears over fitting phenomenon, the generalization error is large, and the training and test correct recognition rates are low; When the learning rate increases, the model generalization error begins to increase until the model diverges and the classification task cannot be carried out.Finally, control the iteration period, with the small batch and learning rate parameters unchanged, and take the momentum method parameters as 0.3, 0.5, and 0.7, respectively.When m=0.3, as shown in Figure 17, the validation error in the first three cycles rapidly decreases, and then continues to steadily decrease, resulting in a stable training process.When m=0.5, as shown in Figure 18, the validation error in the first three cycles decreased rapidly, and the third cycle also reached the plateau period, before stabilizing again.When m=0.7, as shown in Figure 19, the validation error rapidly decreased in the first two cycles, reached the plateau period in the third cycle, and then steadily decreased, with a slight rebound starting from the seventh cycle.Therefore, the sensitivity of the hyperparameter momentum method parameter is low, and the model performance does not change significantly when increasing or decreasing.To sum up, it can be found that hyperparameter such as small batch size and learning rate are more sensitive, while iteration period and momentum method hyperparameter are not sensitive to numerical changes.

Hyperparameter cross validation experiment
Through the above experiments, we have explored the sensitivity of model hyperparameters, and we would like to further explore the optimal combination of hyperparameters.By applying k-fold cross validation, the training set is evenly divided into 10 sets that do not intersect with each other, and different hyperparameters are selected to transform different sensitive hyperparameters.This experiment verifies different combinations of hyperparameters by selecting small batches and learning rates.Based on the results of cross validation, we obtained the average training error and average generalization error (excluding singular values) for different small batches with learning rates ranging from 0.005 to 0.195 and step sizes of 0.005, as shown in Table 2  Figure 20 shows the experimental results of learning rate, training error, and generalization error when the small batch size is 256.From the chart analysis, selecting too small a batch (such as 32) can easily cause overfitting, and both types of errors are relatively high.When the small batch size is 160 and 256, the average generalization error has slightly increased, but overall it shows a downward trend.When the small batch size is 256, the generalization error is smaller than the training error, and the model exhibits extremely strong generalization performance.The learning rate ranges from 0.005 to 0.05, and the training error rapidly decreases.There is a very small rebound at a learning rate of 0.025, and then continues to decline; Singular values appear at learning rates of 0.055 and 0.115.Based on the above results, while observing a reasonable decrease in training error, the hyperparameter combination with the lowest generalization error was selected.Finally, an iteration period of 10, a small batch size of 256 for gradient descent, a learning rate of 0.135, and a momentum method parameter of 0.5 were selected.

Conclusions
This article proposes a deep convolutional neural network model based on residual module to address the risk of overfitting caused by insufficient sample size when applying deep network models to handwritten digit recognition, as well as the problem of gradient vanishing and explosion caused by excessively deep network layers.The experimental results show that the training recognition rate rapidly increased from 91% to over 97% in the first three iteration cycles, while the validation recognition rate slowly increased and remained above 97% from the beginning, indicating that the model has good generalization ability.This experiment proves that hyperparameters such as small batches and learning rates are more sensitive, while iteration period and momentum method hyperparameters are not sensitive to numerical changes.The hyperparameter cross validation experiment has proven that selecting too small a batch can easily cause overfitting, and if the learning rate increases, the network is prone to divergence.By testing the sensitivity of each hyperparameter in the network model and selecting the optimal combination of hyperparameters, a deep convolutional residual network model with strong generalization ability and high positive recognition rate can be trained.

Figure 1 :
Figure 1: The structure of residual block need to be learned.If batch normalization is not beneficial, theoretically, the learned model can be avoided using batch normalization[13].

Figure 3 :
Figure 3: Partial samples from the MNIST dataset Due to the fact that the validation dataset does not participate in model training, it is too luxurious to reserve a large amount of validation data when the training data is insufficient.One improvement method is k-fold cross validation.In k-fold cross validation, we divide the original training dataset into k non-overlapping sub-datasets, and then we perform k-fold model training and validation.Each time, a sub dataset is used to validate the model and train the model using other k-1 sub-datasets.In this k training and validation sessions, the sub-datasets used to validate the model were different each time.Finally, we averaged training error and validation error for these k times.We used the deep residual convolutional neural network shown in Figure 4 for this experiment.

Figure 4 :
Figure 4: The network model used in the experiment

Figure 11
Figure 11: Batch_Size=128 Figure 12: Batch_Size=256 , but the training error decreases more rapidly in the first two cycles and gradually decreases thereafter.The experimental results show that when the size of hyperparameter small batch changes, the training recognition rate and test recognition rate of the model on the MNIST dataset show almost no change.

Table 1 :
. Predicting experimental results data

Table 2 :
: Cross validation experimental results data