YouGlish + n-grams = Targeted bottom-up listening practice

YouGlish is a great site that allows you to search for YouTube videos by their transcripts rather than only by title, category, or content tags. This means you can cycle through dozens of videos to listen to the pronunciation of a word or phrase. YouGlish loads each video so it starts at the point where the word/phrase is spoken.

It’s a great tool in itself, but even more powerful with a list of common “chunks” of words spoken quickly. These include phrases like the following:

what do you think of
most of the time
should have gotten
asked her to

So what’s worth practicing?

We can find which chunks are most worth listening to (for bottom-up listening practice) by analyzing the following:

Which chunks occur most frequently in spoken English? (We can look at chunks of 3 words, 4 words, 5 words, etc.)
Which chunks have syllable/sound patterns that are commonly reduced or linked in fast spoken English?

Once we determine the above, we can target phrases that learners will commonly encounter but are likely to mishear due to their reduced/linked nature. Phrases like those listed above.

Step 1: Finding the most common phrases

The wonderful site ngrams.info contains n-gram frequency data for the Corpus of Contemporary American English (COCA) and iWeb corpus. The COCA includes both spoken and written English, but requires payment for access to the n-grams, while the 1,000,000 most frequent n-grams of the iWeb corpus can be downloaded for free. Ideally, we’d aim to only gather data for spoken English, but for the time being I just used the iWeb corpus to test things out.

Looking at the frequencies for the most common n-grams, we find that the most common 5-grams (chunks of 5 words) are the following:

at the end of the
is one of the most
at the top of the
in the middle of the

Now that we have some common patterns of formulaic language, we could copy/paste these into YouGlish and practice slowed-down, bottom-up listening. However, to be even more efficient, we can filter these to just get the phrases containing reduced or linked sounds, as explained below.

Step 2: Filtering by commonly-reduced words

Just because a phrase is common doesn’t mean that it’s difficult to hear in fast speech. Usually students have trouble listening when sounds are reduced, blended, or linked. Therefore, we’d like to filter the list of common phrases (n-grams) to only include those containing these sound patterns. We’ll filter in 2 stages: first by word, and next by sound combinations. First, which words are commonly reduced in English? Below is a partial list.

him and them often get reduced to ’em
the vowels in common short words (for, and, do, you, to, have) often get reduced to schwa (What are you going to do? -> Whət ə yə gənna do?) (should have -> should əv )
Hearing the schwa reduction can help distinguish can from can’t in American English (can reduces but can’t doesn’t) (can -> cən)
of often gets reduced to ə when followed by a consonant (a cup ə tea)

Using Python, we can filter the list of most frequent n-grams to only those containing these words:

# ngrams_list is an array of most frequent n-grams.  It can be the 10 most frequent groups of 4 words, 100 most frequent groups of 3 words, etc.

reduced_words = str.split('him them for and do you to have can of')
filtered_list = []

for ngram in ngrams_list:
     for word in reduced_words:
          if word in ngram:
               filtered_list.append(ngram)
               break

return filtered_list

We then get a list of only n-grams containing these words which are often reduced in fast spoken English.

Step 3: Filtering by reduced consonant/vowel combinations

Below are some rules for sound changes that depend on vowels and consonant combinations.

any final consonant can get linked with an onset vowel in the next word (i.e. out of, as large as, can ask)
“t” and “d” followed by a vowel can get reduced to an alveolar flap (water, butter, what are, had it)
onset “h” becomes silent when preceded by a consonant (I think I like her. -> “li ker”)

We can filter our list of n-grams once more using regular expressions to represent the above rules:

import re

# This will match rule #1 - final consonant + onset vowel:
pattern1 = r'/(?<![aeiou])e* [aeiou]/g'

# This will match rule #2 - the alveolar flaps:
pattern2 = r'/(t|d)e* [aeiou]/g'

# This will match rule #3 - final consonant + onset "h":
pattern3 = r'/(?<![aeiou])e* [aeiou]/g'

One thing to keep in mind is that with regex we’re matching letters, but what we really want to match is sounds. Thus, we need the “e*” in our regex patterns above to account for the silent “e” after consonants at the end of words like state, come, available. We could attempt to tweak the code to achieve perfect accuracy for all spelling rules, but for this seems good enough.

Below, we can see an example of matches using the regex patterns described above. Notice that with the exception of an errant comma, it nicely picked up each of the 3 rules for sound changes that depend on vowels and consonant combinations:

(I tested this using regexr.com and a sample website blurb)

Running these same regex patterns to filter our list of common n-grams, we can produce a more finely-filtered list of only useful n-grams to practice bottom-up listening:

new_filtered_list = []

for ngram in filtered_list:
# if the ngram contains at least one of our sound patterns
     if re.search(pattern1, ngram) != None or re.search(pattern2, ngram)!= None or re.search(pattern3, ngram)!= None:
          new_filtered_list.append(ngram)

return new_filtered_list

Among the top one million most frequent 4-word phrases in English, this new_filtered_list will contain only those phrases with commonly-reduced or linked sounds! It’s therefore a good list to use with YouGlish in order to engage in targeted, bottom-up listening practice for identifying reduced sounds.

Have you tried YouGlish? Interested in finding frequently-spoken, commonly-reduced chunks of language? Need help following the above? Feel free to comment or contact me. Thanks!

YouGlish + n-grams = Targeted bottom-up listening practice

So what’s worth practicing?

Step 1: Finding the most common phrases

Step 2: Filtering by commonly-reduced words

Step 3: Filtering by reduced consonant/vowel combinations

Similar Posts

Thoughts about vocabulary sort tasks for content-area words

Finding the right “chunks” to teach for fluency

One thought on “YouGlish + n-grams = Targeted bottom-up listening practice”

Leave a Reply to mura Cancel reply