Steven Cao (UC Berkeley)

Date: Sep 25, 2020

Title and Abstract

Cross-lingual Embedding Alignment

Cross-lingual word embeddings are vector representations for words, with the property that words with similar meanings across languages are given similar vectors (i.e. “aligned” across languages). In this talk, I'll present various methods for producing aligned embeddings. First, in the non-contextual case (where each word in the vocabulary is mapped to a vector), I'll describe alignment algorithms based on bilingual dictionaries, whose ideas form the basis for recent successes in unsupervised machine translation. Next, I'll move on to the contextual case, where the vector representation for a word also contains information about the sentence it is in. For example, “close” in “Please close the door” and “The deadline is close” would be given the same non-contextual vectors but different contextual vectors. First, I'll show that the multilingual BERT model implicitly learns aligned contextual representations as a result of pre-training on multilingual language modeling. Next, I'll show that alignment explains the model's strong results in zero-shot cross-lingual transfer (the setting where we train on English for some task but test in other languages), allowing us to further improve cross-lingual transfer performance with a supervised alignment procedure. Finally, I'll use this insight to draw an interesting connection between multilingual pre-training and the unsupervised alignment algorithms presented in the first half of the talk.