Text Parsing
Finding words and phrases in text
Our decision to work with a relational versus a document-based database gave us the flexibility to perform complex queries. Our decision to use PostgreSQL specifically, brought another unexpected benefit: built-in full text search.
On the front end, Alexandria highlights all the words and phrases a user has already encountered. To find out whether a text contains words or phrases that the user has already marked, the words of the text have to be compared to all the user’s words and phrases saved in the database.
While implementing a search algorithm using tries, a closer look at the PostgreSQL documentation yielded the discovery that the database provided full text search. Not only were the search functionalities just what we needed, implementing this solution provided a number of additional advantages over a solution of our own:
- Written in C, the database implementation to support search was likely faster than anything implemented in TypeScript.
- Since data would have been cleansed and filtered at the start of the chain, it was more efficient than having the entire data set transferred and processed by either server or client.
PostgreSQL text search works by parsing and normalizing a text into a text search vector of the tsvector
data type. Search terms on the other hand, are parsed and normalized into text search queries of the tsquery
type. The parser breaks a string (text or search term) into individual tokens, mostly words in our case.
Normalization removes common small words from tokens, and reduces words to their stems according to language specific dictionaries. Heroku’s PostgreSQL installation came with support for twenty languages.
While this normalization step is powerful, its default setting is geared for full text search, returning results including “handling” and “handlebar” when the original search term is “handle”. For a language learner, this was too lenient, as “handle” and “handling” are considered two distinct words. Luckily for our purposes, PostgreSQL also ships with a simple
configuration that parses the text, but leaves the tokens as they are. This is the configuration implemented with Alexandria.
To save processing time, each text is parsed immediately into a tsvector
when added to the database (or modified later). The vector is saved in a column of the texts
table defined like this:
ALTER TABLE texts
ADD COLUMN tsvector_simple tsvector
GENERATED ALWAYS AS (to_tsvector('simple', title || ' ' || body)) STORED
In a similar fashion, every word or phrase added to the database also gets a column in which the parsed tsquery
is saved.
We find all the words in text 5
that user 1
has in their vocabulary with this SQL query:
SELECT w.id, w.word, uw.word_status
FROM words AS w
JOIN users_words AS uw ON w.id = uw.word_id
WHERE uw.user_id = 1
AND
w.language_id = (SELECT t.language_id FROM texts AS t
WHERE t.id = 5)
AND
w.tsquery_simple @@ (SELECT t.tsvector_simple FROM texts AS t
WHERE t.id = 5);
Parsing text into React and HTML
Leveraging the prowess of PostgreSQL, a user’s texts and vocabulary list will make their way to the client. The front-end application kicks in at this stage to present these two sets of data, satisfying the following requirements:
- Every occurrence of a word from the list must be highlighted in the text according to its level of familiarity.
- Every occurrence of a phrases must be highlighted in the text according to its level of familiarity.
- Each word from the text must be clickable, regardless of whether a word is highlighted, or whether it is part of a phrase.
- Each phrase must be clickable/selectable.
- Punctuation and spaces must be part of any words (with the exception of the apostrophe and hyphen).
- The text must be split into sentences to facilitate saving the context of a word when adding a new translation.
- Bonus: Phrases in the
userWords list
should be highlighted if found in text, regardless of whether phrase in the list or phrase in the text contains punctuation.
Due to the way the browser DOM handles click
events, it was clear words would end up nested in <span>
s. A phrase meant another set of <span>
s around a word <span>
s . Splitting the text at the whitespace characters did not suffice, since punctuation stuck to their preceding word. We approached the problem with regular expressions.
Zooming in from the full text body down to individual words, text is parsed and split up into React components and HTML elements in these steps.
-
The text body is divided into
Paragraph
components, using thesplit
string method. -
Paragraphs are split into
Sentence
components using this regular expression:/[^\s]([^!?.:;]|\.{3})*["!?.:;\s]*/g
This expression splits sentences at
!
?
.
:
;
but not at an ellipsis...
-
Sentences are split into tokens that are either phrases and words, or spaces and punctuation.
Corresponding regular expression matches 1) all phrases found on the list, 2) any words, 3) non-words (ie. spaces and punctuation).
Here is an example with two phrases:
/of course|get up|(?<words>[\p{L}\p{M}\'-]+)|(?<nowords>[^\\p{L}\\p{M}\'-]+)/gui
The Unicode character class
\p
is utilized with categories Letter{L}
and Mark{M}
to account for letters beyond standard ASCII characters. -
To satisfy the bonus criterion of finding phrases with any punctuation in them, phrases at the beginning need to be replaces with:
/of[^\\p{Letter}\\p{Mark}\'-]+course|get[^\\p{Letter}\\p{Mark}\'-]+up/
to allow punctuation or spaces between words that make up the phrase.
Additionally, in preparation for parsing all phrases from the users vocabulary, punctuation is stripped off of words.
-
If a token is a phrase (simply checking for spaces suffices), it will be handed over to a
Phrase
component. Words go into theWord
components, and punctuation and spaces are wrapped into<span>
s. That is necessary for selecting phrases from the text (see next section). -
In a
Phrase
component, a phrase is wrapped into<span>
s, and the parsing process from steps 3 and 5 is repeated without phrases. -
Finally, the
Word
component wraps<span>
s around a word.
The CSS for highlighting the words is applied in steps 6 and 7.