Text Parsing

Finding words and phrases in text

Our decision to work with a relational versus a document-based database gave us the flexibility to perform complex queries. Our decision to use PostgreSQL specifically, brought another unexpected benefit: built-in full text search.

On the front end, Alexandria highlights all the words and phrases a user has already encountered. To find out whether a text contains words or phrases that the user has already marked, the words of the text have to be compared to all the user’s words and phrases saved in the database.

While implementing a search algorithm using tries, a closer look at the PostgreSQL documentation yielded the discovery that the database provided full text search. Not only were the search functionalities just what we needed, implementing this solution provided a number of additional advantages over a solution of our own:

  • Written in C, the database implementation to support search was likely faster than anything implemented in TypeScript.
  • Since data would have been cleansed and filtered at the start of the chain, it was more efficient than having the entire data set transferred and processed by either server or client.

PostgreSQL text search works by parsing and normalizing a text into a text search vector of the tsvector data type. Search terms on the other hand, are parsed and normalized into text search queries of the tsquery type. The parser breaks a string (text or search term) into individual tokens, mostly words in our case.

Normalization removes common small words from tokens, and reduces words to their stems according to language specific dictionaries. Heroku’s PostgreSQL installation came with support for twenty languages.

While this normalization step is powerful, its default setting is geared for full text search, returning results including “handling” and “handlebar” when the original search term is “handle”. For a language learner, this was too lenient, as “handle” and “handling” are considered two distinct words. Luckily for our purposes, PostgreSQL also ships with a simple configuration that parses the text, but leaves the tokens as they are. This is the configuration implemented with Alexandria.

To save processing time, each text is parsed immediately into a tsvector when added to the database (or modified later). The vector is saved in a column of the texts table defined like this:

        ALTER TABLE texts
         ADD COLUMN tsvector_simple tsvector 
GENERATED ALWAYS AS (to_tsvector('simple', title || ' ' || body)) STORED

In a similar fashion, every word or phrase added to the database also gets a column in which the parsed tsquery is saved.

We find all the words in text 5 that user 1 has in their vocabulary with this SQL query:

  SELECT w.id, w.word, uw.word_status
    FROM words AS w 
    JOIN users_words AS uw ON w.id = uw.word_id 
   WHERE uw.user_id = 1 
         w.language_id = (SELECT t.language_id FROM texts AS t 
                           WHERE t.id = 5)
         w.tsquery_simple @@ (SELECT t.tsvector_simple FROM texts AS t 
                               WHERE t.id = 5);        

Parsing text into React and HTML

Leveraging the prowess of PostgreSQL, a user’s texts and vocabulary list will make their way to the client. The front-end application kicks in at this stage to present these two sets of data, satisfying the following requirements:

  • Every occurrence of a word from the list must be highlighted in the text according to its level of familiarity.
  • Every occurrence of a phrases must be highlighted in the text according to its level of familiarity.
  • Each word from the text must be clickable, regardless of whether a word is highlighted, or whether it is part of a phrase.
  • Each phrase must be clickable/selectable.
  • Punctuation and spaces must be part of any words (with the exception of the apostrophe and hyphen).
  • The text must be split into sentences to facilitate saving the context of a word when adding a new translation.
  • Bonus: Phrases in the userWords list should be highlighted if found in text, regardless of whether phrase in the list or phrase in the text contains punctuation.

Due to the way the browser DOM handles click events, it was clear words would end up nested in <span>s. A phrase meant another set of <span>s around a word <span>s . Splitting the text at the whitespace characters did not suffice, since punctuation stuck to their preceding word. We approached the problem with regular expressions.

Zooming in from the full text body down to individual words, text is parsed and split up into React components and HTML elements in these steps.

  1. The text body is divided into Paragraph components, using the split string method.

  2. Paragraphs are split into Sentence components using this regular expression:


    This expression splits sentences at ! ? . : ; but not at an ellipsis ...

  3. Sentences are split into tokens that are either phrases and words, or spaces and punctuation.

    Corresponding regular expression matches 1) all phrases found on the list, 2) any words, 3) non-words (ie. spaces and punctuation).

    Here is an example with two phrases:

    /of course|get up|(?<words>[\p{L}\p{M}\'-]+)|(?<nowords>[^\\p{L}\\p{M}\'-]+)/gui

    The Unicode character class \p is utilized with categories Letter {L} and Mark {M} to account for letters beyond standard ASCII characters.

  4. To satisfy the bonus criterion of finding phrases with any punctuation in them, phrases at the beginning need to be replaces with:


    to allow punctuation or spaces between words that make up the phrase.

    Additionally, in preparation for parsing all phrases from the users vocabulary, punctuation is stripped off of words.

  5. If a token is a phrase (simply checking for spaces suffices), it will be handed over to a Phrase component. Words go into the Word components, and punctuation and spaces are wrapped into <span>s. That is necessary for selecting phrases from the text (see next section).

  6. In a Phrase component, a phrase is wrapped into <span>s, and the parsing process from steps 3 and 5 is repeated without phrases.

  7. Finally, the Word component wraps <span>s around a word.

The CSS for highlighting the words is applied in steps 6 and 7.