Skip to main content Accessibility help
×
Home

Emerging trends: Subwords, seriously?

  • Kenneth Ward Church (a1)

Abstract

Subwords have become very popular, but the BERTa and ERNIEb tokenizers often produce surprising results. Byte pair encoding (BPE) trains a dictionary with a simple information theoretic criterion that sidesteps the need for special treatment of unknown words. BPE is more about training (populating a dictionary of word pieces) than inference (parsing an unknown word into word pieces). The parse at inference time can be ambiguous. Which parse should we use? For example, “electroneutral” can be parsed as electron-eu-tral or electro-neutral, and “bidirectional” can be parsed as bid-ire-ction-al and bi-directional. BERT and ERNIE tend to favor the parse with more word pieces. We propose minimizing the number of word pieces. To justify our proposal, a number of criteria will be considered: sound, meaning, etc. The prefix, bi-, has the desired vowel (unlike bid) and the desired meaning (bi is Latin for two, unlike bid, which is Germanic for offer).

  • View HTML
    • Send article to Kindle

      To send this article to your Kindle, first ensure no-reply@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about sending to your Kindle. Find out more about sending to your Kindle.

      Note you can select to send to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be sent to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

      Find out more about the Kindle Personal Document Service.

      Emerging trends: Subwords, seriously?
      Available formats
      ×

      Send article to Dropbox

      To send this article to your Dropbox account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Dropbox.

      Emerging trends: Subwords, seriously?
      Available formats
      ×

      Send article to Google Drive

      To send this article to your Google Drive account, please select one or more formats and confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your <service> account. Find out more about sending content to Google Drive.

      Emerging trends: Subwords, seriously?
      Available formats
      ×

Copyright

This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.

Corresponding author

References

Hide All
Coker, C.H., Church, K.W. and Liberman, M.Y. (1991). Morphology and rhyming: Two powerful alternatives to letter-to-sound rules for speech synthesis. In The ESCA Workshop on Speech Synthesis, pp. 8386.
Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, pp. 41714186.
Schuster, M. and Nakajima, K. (2012). Japanese and Korean voice search. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 51495152.
Sennrich, R., Haddow, B. and Birch, A. (2016). Neural machine translation of rare words with subword units. In Association for Computational Linguistics, pp. 17151725.
Sun, Y., Wang, S., Li, Y., Feng, S., Tian, H., Wu, H. and Wang, H. (2019). Ernie 2.0: A continual pre-training framework for language understanding. arXiv preprint, arXiv:1907.12412.
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O. and Bowman, S.R. (2018). Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint, arXiv:1804.07461.
Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, Ł., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes, M. and Dean, J. (2016). Googles neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.

Keywords

Emerging trends: Subwords, seriously?

  • Kenneth Ward Church (a1)

Metrics

Full text views

Total number of HTML views: 0
Total number of PDF views: 0 *
Loading metrics...

Abstract views

Total abstract views: 0 *
Loading metrics...

* Views captured on Cambridge Core between <date>. This data will be updated every 24 hours.

Usage data cannot currently be displayed.