How tokenizers, part-of-speech taggers, and algorithms can parse sentences for use in language learning apps
Natural Language Processing (NLP) is a field most often associated with tasks like automatically summarizing news articles, parsing search queries so computers can retrieve relevant results, and “chatbots” that take the place of humans for customer service.
I’m not interested in these tasks. I’ve always wanted to explore how NLP might be useful in helping people learn languages. NLP can be incredibly powerful, but its power seems to rarely be used in language learning apps. As an example, NLP tools can parse sentences to generate syntax trees, and label parts of speech and noun or verb phrases. These features could automate things like word-order correction tasks or input enhancement. However, I’ve yet to see a straightforward, user-friendly app that uses NLP behind the scenes.
How should we chunk sentences?
My app, TextMix, does it too arbitrarily
After creating my automated sentence scramble app TextMix, I realized it may not be as grounded in TESOL theory as I’d hoped. Specifically, generating scrambled sentences by “chunking” them arbitrarily every 2 or 3 words and mixing up these chunks might make for a fun activity, but if there’s no logic to which words get “chunked” together, the learner isn’t really processing language in a useful way.
To illustrate, below is an example of how TextMix “chunks” a sentence when the “chunk size” is set to 3. As you can see, the only chunk that is syntactically meaningful is “the world” (it’s a noun along with its determiner/article).
The other chunks don’t make sense on their own; they’re meaningless. “Tea is a” is not a meaningful chunk – an noun phrase, verb phrase, prepositional phrase, clause, etc. The above chunks don’t reflect how we actually parse sentences when we read. They aren’t how we “chunk.” Plenty of research shows that effective readers (and speakers) process language in chunks, set phrases, collocations, etc. Not individual words.
Now, a much more natural way to chunk the above sentence would be as follows:
Keep in mind that this isn’t the only way we could chunk the above sentence. We could also group the noun phrase “a drink that is popular,” although if we’d like our student to practice assembling adjective clauses, then separating the “that is popular” might be a better choice. Similarly, because “all over the world” is a formulaic expression, we can keep it as its own chunk.
So we’ve established that grouping words into meaningful chunks is likely better for learners, as it reflects how we naturally parse language when reading or speaking. But TextMix is an app that gets text from Wikipedia, news articles, and any user-pasted text. It’s not possible to manually chunk each of these sentences, so we’ll need to automate this task with code. Specifically, we need an NLP tool to do the job. In the next section, I describe my quest for such a tool.
Using NLP tools to help chunk more naturally
So there are a few popular NLP tools, mostly in Python. One is the Natural Language Toolkit (NLTK), which I found simple to use but difficult to produce accurately-chunked sentences. While NLTK can tag parts of speech, it requires you to create your own regular expressions to attempt to parse the sentence into noun phrases, verb phrases, etc. After that, you’ll need to experiment to chunk it into whichever other meaningful chunks you’re after. This was slightly too intense for my purposes. I was looking for something more out-of-the-box.
This brought me to SpaCy, another Python package. SpaCy can generate noun phrases out of the box with a few lines of code and no regular expressions (that I needed to write, anyway). In a way, SpaCy builds on previous work already done, which is exactly what I was after as someone who’s more interested in TESOL and applied EdTech than coding algorithms.
Anyway, SpaCy can’t generate everything we need to effectively break sentences into meaningful chunks. That’s partly because chunking is relative; there are multiple ways to chunk a sentence. This means we’ll need to adjust SpaCy’s “pipeline” slightly to get the chunks we need. I will post my adventures with this in another post, but for now, here is a chart showing the kind of chunks SpaCy can parse out of the box and those that I’ll need to figure out how to get it to make:
|SpaCy can parse these chunks with just a few lines of code:||SpaCy needs help to parse these chunks:|
|noun phrases||verb phrases|
|determiner-noun pairs||prepositional phrases|
|proper nouns||clauses (dependent, independent, relative, etc.)|
|formulaic sequences (idioms, collocations, etc.)|