Very High Resolution Images Classification by Fusing Deep Convolutional Neural Networks

Recently, deep convolutional neural networks (CNNs) have made great achievements, whether taken as features extractor or classifier, in particular for very high resolution (VHR) images classification task which is a key point in the remote sensing field. This work aims to improve the VHR image classification accuracy by exploiting the fusion of two pre-trained deep convolutional neural network models. In this paper, we propose to concatenate the features extracted from the last convolutional layer of each pre-trained deep convolutional neural network to get a long features vector which is fed into a fullyconnected layer and then perform a fine-tuning for a VHR image dataset classification. The experimental results are promising since they show that the fusion of two deep CNNs achieves better accuracy for the classification compared to the individual CNN models on the same dataset.


Introduction
The technological advancements have made deep convolutional neural networks [1] outperform other algorithms by an order of magnitude in classification tasks, especially for very high resolution remotely sensed image field and this, despite the large number of complex land cover classes which makes its efficient classification a challenging problem. The deep convolutional neural network contains a large number of parameters, and therefore they are trained with very large datasets. Despite recent advancements in capturing remote sensing imagery by satellite, drones and planes that made the availability of very high resolution images, this field still lacks of large datasets to correctly train a deep CNN.
To overcome this problem, one approach is to use the CNN as features extractor and Razavian et al. [2] demonstrated that features extracted from the pre-trained CNN should be considered as a generic image representation in most visual recognition tasks. For example, Hu et. al [3] investigated transferring the activations of pre-trained CNNs to high resolution images classification task by extracting features from the last convolutional layer and fully-connected layer of a deep network pre-trained on ImageNet as generic features and further, coded them using traditional feature encoding methods like BOW [4], locality-constrained linear coding (LLC) [5], Vector of Locally Aggregated Descriptors (VLAD) [6] and Improved Fisher Kernel (IFK) [7] to generate a robust image representation and they achieved remarkable accuracies. Another approach proposed by Oquab et. al [8], is to fine-tune the CNN parameters using the target dataset after that parameters are learned on a large dataset with data diversity such as ImageNet database [9]; this approach is called Transfer Learning (TL) and the parameters transfer success depends on the similarity of the target dataset to the ImageNet database. Taking this into account in the previous work [10], we evaluate the generalization of deep convolutional neural networks features from the last fully-connected layer in the classification process of a small dataset of a VHR imagery by finetuning CNN deeper layers only. The obtained results show better accuracy compared to that of CNNs as features extractor.
Focused on continued improvement of CNN efficiency and in order to overcome the lack of a large dataset, a number of recent techniques in recognition task have proposed to fuse two features extractors, and the first notable progress was made by Simonyan et al [11] who proposed a twostream network architecture designed to mimic the pathways of the human visual cortex for object and motion recognition. Very recently, bilinear CNNs were proposed [12], using two CNNs pretrained on the ImageNet dataset and fine-tuned on fine-grained image classification datasets. The outputs of two CNNs are multiplied using the outer product at each location of the image, and are pooled to obtain an image descriptor.
In this paper, we propose fusing two pre-trained CNN models for VHR image classification where, the outputs from the last convolutional layer of each network are concatenated to get a feature vector, then we perform a fine-tuning for the classification by adding fully connected layers on top as shown in figure 1. The experiment is performed on WHU-RS dataset [11]. The experimental result shows that the fusion of two CNNs obtains better classification accuracy compared to the individual CNNs fine-tuned or taken as features extractors on the WHU-RS dataset.

Proposed Approach
The fusion illustration showed above is inspired by the bilinear models proposed by Tenenbaum and Freeman [14] to model the separation between "content" and "style" factors of perceptual systems, and is motivated by the promising results obtained using the bilinear CNNs applied to finegrained categorization [12].
This work is based on two popular CNNs, CaffeNet [15] and GoogleNet [16] whose parameters were trained on 1,200,000 ImageNet images and 1,000 object classes as features extractors. The first model is a replication of the AlexNet model [17], it contains 5 convolutional layers and 3 fully connected layers with a minor difference in the order of the pooling and the normalization layers (in CaffeNet, pooling is done before normalization). The second model is deeper, it contains an Inception module which enables Google's team to go as deep as 22 layers, but at the same time has even far less parameters (12 times less) than AlexNet model. Figure 2 Examples of each category in the WHU-RS dataset [12].
Both networks are considered as features extractors where, features are extracted from the last convolutional layer of CaffeNet and the output of the top inception module of GoogleNet then, the features are concatenated to create a long vector which is fed into fully-connected layers followed by a k-way SoftMax layer initialized randomly where, k is the number of classes of the target dataset. Finally, we perform a fine-tuning of the model for the classification of WHU-RS dataset, which contains 50 satellite images with a size of 600×600 for each of the 19 classes, collected from Google Earth. The collection from different resolution satellite images, the variation in scale, illumination and viewpoint-dependent appearance in some categories make this dataset more complicated [3] as displayed in Figure 2. The results obtained are evaluated by comparing them to those obtained by [10], where authors fine-tuned the CNN models, and [3] where the CNNs are considered as features extractors; both works were carried out for the classification of the WHU-RS dataset.

Experiments and Results
In this paper, we evaluate our approach with different combination of CaffeNet and GoogleNet: the first one is initialized with two CaffeNet models, the second is initialized with a CaffeNet and a GoogleNet models and the third one is initialized with two GoogleNet models. The features are extracted using both networks, then features are concatenated and the entire model is fine-tuned for several epochs at decreased learning rate while the SoftMax activation function is trained at a normal rate, preserving the parameters of the previous models and transferring that learning into our new model. The experiments have been performed on the deep learning framework Caffe developed by Yangqing Jia [18]. Following the form of 5-fold cross validation, we perform training and testing split.
The results obtained for the three combinations of CaffeNet and GoogleNet as features extractors from the last convolutional layer (C5) and the output of the top inception module (IM) respectively, are reported for WHU-RS dataset in the following tables where they are compared to the results obtained for the same dataset classification task by fine tuning each CNN model [10] and CaffeNet model as features extractor from fully connected layer and convolutional layer [3].  Tables 1 and 2 display the WHU-RS dataset classification accuracies of our different fusion models compared with the previous work done on WHU-RS dataset. "IM" and "C5" refer to the concatenation with features extracted from the output of the top inception module of GoogleNet and from the last convolutional layer of CaffeNet respectively. "FC" refers to Fully Connected layer. Thus, "Features coded with IFK method" refers to features extracted from the last convolutional layer of CaffeNet and coded with Improved Fisher Kernel method. The bolded values are the highest of these classifications results.
The results presented above show that our fusion of CaffeNet and GoogleNet without any preprocessing or data augmentation for the WHU-RS dataset classification task improves the accuracy from 2% to 3.5% compared with individual GoogleNet without and with data augmentation. Even for the fusion of two GoogleNet we obtained a bit higher accuracy (1%) compared to GoogleNet fine-tuned with data augmentation. The fusion of two CaffeNet improves the results obtained for the compared methods and got almost the same accuracy as the features coded with IFK method. Compared to CaffeNet fine-tuned with data augmentation, the fusion of CaffeNet and GoogleNet improves the accuracy with 1.5%, 2% compared with CaffeNet as features extractor from 1st FC layer, 3% compared with CaffeNet as features extractor from 2nd FC layer and almost 1% compared with the features of the convolutional layer coded with IFK.

Conclusions
In this novel work, we propose the fusion of two pre-trained deep convolutional neural networks to improve the classification accuracy of the WHU-RS dataset. The results illustrate that the fusion of two CaffeNet models (C5+C5) turned out to be the lowest performing fusion model with 97.31% accuracy while, fusing two GoogleNet models (IM+IM) performed slightly better accuracy with 97.65%. The best result is achieved by fusing two different models CaffeNet and GoogleNet (C5+IM) with an accuracy of 98.22%. The results show that the fusion of two different models outperformed the individual models; this can be explained by the fact that each network architecture leads to an image representation which can differ slightly from an architecture to another even if each network is trained on the same dataset. Even the lowest performing fusion model with same two networks (C5+C5 with 97.31%) performed almost the same accuracy as the best performing individual model taken as features extractor and codded with IFK method (97.43%).