Education, Science, Technology, Innovation and Life
Open Access
Sign In

Study of Unsupervised Learning of Visual Objects with Spoken Words based on Siamese Network

Download as PDF

DOI: 10.23977/cii2019.43

Author(s)

Hanjuan Zhang

Corresponding Author

Hanjuan Zhang

ABSTRACT

Humans learn to understand spoken languages and recognize images by mostly unsupervised learning. It should also be possible for machines to jointly learn spoken languages and visual object. By abandon the fully connected layer and trying different effective networks like vgg19 and resnet, the existing neural network processing directly on the image pixels and audio waveform can be improved. My models also do not rely on any labels or any common supervision. The best model trained on parallel speech and images achieves a precision of over 70% on its top ten retrievals and almost 50% on its top five retrievals. Flickr8k dataset and audios dataset converted from Flickr8k text are used to train the models.

KEYWORDS

Speech and image, convolutional networks, multimodal learning, deep learning

All published work is licensed under a Creative Commons Attribution 4.0 International License.

Copyright © 2016 - 2031 Clausius Scientific Press Inc. All Rights Reserved.