Harris Quach
  • Home
  • About
  • Updates
  • Research
  • Software
  • Extended Posts

Tokenization

Splitting a document into its component words.

Note

If a word is too long or very uncommon, the word itself may be split. Take the word “supercalifragilisticexpialidocious” as an example. It could be split into “super”, “cali”, “fragilistic”, “expi”, “ali”, and “docious”.

Back to top

Harris Quach © 2024

  • Github

  • LinkedIn

  • Google Scholar

Version 2.2.1.5 | Feedback | Website made with Quarto