Applications of Python to linguistics

An example text file

“Adam Bede ” was written in 1859. The file with 1179 KO can be downloaded from »https://www.gutenberg.org/files/507/507-0.txt« to »507-0.txt« (into the same folder as the folder of the Python script).

Scrubbing

In “Data Jujitsu,” DJ Patil states that “80% of the work in any data project is in cleaning the data” (2012). (quoted via “Data Science at the Command Line ” by Jeroen Janssens [2014-09-23])

The file »507-0.txt« has some lines not part of “Adam Bede ” at the beginning and at the end that should be removed manually. Automatic identification and removal of front and end matter is beyond the scope of this course. So the file should look as follows.

Adam Bede (beginning)

ADAM BEDE

by George Eliot

Adam Bede (end)

“So there is,” said Dinah. “Run, Lisbeth, run to meet Aunt Poyser. Come
in, Adam, and rest; it has been a hard day for thee.”

When saving with a text editor often one can set the encoding of the file. Remember this encoding for later use. Save as »507-0.A.txt« (never modify an an original source).

Standardization of paragraphs

Adam Bede (excerpt): Here some measurement was to be taken which required more concentrated↵
attention, and the sonorous voice subsided into a low whistle; but it↵
presently broke out again with renewed vigour--↵
↵
Let all thy converse be sincere,↵
Thy conscience as the noonday clear.↵

strategy:

load »507-0.A.txt«
replace the pattern »\n +« by just »\n« to remove all indentations
replace »\n\n+« temporarily by »@« to keep paragraph breaks
replace »\n« by a space » «
replace »@« by »\n« to create paragraph breaks
if the text does not end with a »\n«, add one
write it into a file with a different name »507-0.B.txt« (never modify the result of a previous work step)

main.py

import re
with open( "507-0.A.txt", "r", encoding="UTF-8" )as f:
    text = f.read()
text = re.sub( r"\n +", r"\n", text )
text = re.sub( r"\n\n+", r"@", text )
text = re.sub( r"\n", r" ", text )
text = re.sub( r"@", r"\n", text )
if text[ -1 ]!= '\n': text += '\n'
with open( "507-0.B.txt", "w", encoding="UTF-8" )as f:
    f.write( text )

Show all words of the form “s…n” in context

strategy:

load »507-0.B.txt«
replace »\n« by »@ « (marking an end of a paragraph)
prepend and append 1000 space for uniformity
search for: 30 characters, the pattern s…n, 60 characters, i.e., ».{30}\bs\w*n\b.{60}«
cut it to 70 characters and print it

We ₍ˈwʌnə₎ want to see the first 10 results only and prefer simple quotation marks instead of typographic quotation marks.

main.py

from re import finditer, sub
with open( "507-0.B.txt", "r", encoding="UTF-8" )as f:
    text = f.read()
text = sub( r"\n", r"@ ", text )
text = sub( r"“", r'"', text )
text = sub( r"”", r'"', text )
text = ' ' * 1000 + text + ' ' * 1000
for context in list( finditer( r".{30}\bs\w*n\b.{60}", text ))[ :10 ]:
    print( context.group( 0 )[ :70 ])

Protokoll

our Lord 1799.@ The afternoon sun was warm on the five workmen there, 
 Awake, my soul, and with the sun Thy daily stage of duty run; Shake o
help laughing at myself."@ "I shan't loose him till he promises to let
" said Ben, "but I donna mind sayin' as I'll let 't alone at your aski
. Ye'll do finely t' lead the singin'. But I don' know what Parson Irw
d things at Will Maskery's. I shan't be home before going for ten. I'l
Preaching@ About a quarter to seven there was an unusual appearance of
e in which the weather-beaten sign left him as to the heraldic bearing
econcile his dignity with the satisfaction of his curiosity by walking
 head: many of 'em goes stark starin' mad wi' their religion. Though t

Further clarification of the treatment of strings like “shan't ” or “starin' ” might be needed.

(Is this called “concordance”?)

Show the frequency of characters

main.py

from collections import Counter
print( Counter( open( "507-0.B.txt", "r", encoding="UTF-8" ).read() ))

Protokoll

Counter({' ': 211049, 'e': 108848, 't': 80842, 'a': 70372, 'o': 69383, 'n': 61990, 'h': 60459, 's': 55755, 'i': 53390, 'r': 50151, 'd': 37827, 'l': 34912, 'u': 25211, 'w': 22128, 'm': 21341, 'g': 19424, 'y': 18810, 'f': 18810, 'c': 17203, ',': 15329, 'b': 13116, 'p': 12160, "'": 9064, '.': 8935, 'k': 8821, 'v': 7362, '-': 4798, 'I': 4527, '{$c65533}': 2719, '{$c65533}': 2689, 'A': 2636, '\n': 2569, 'H': 1762, ';': 1709, 'T': 1616, 'M': 1525, 'S': 1364, 'B': 1266, 'x': 947, 'D': 867, 'W': 790, '?': 772, 'q': 741, 'j': 699, 'P': 659, 'C': 480, 'Y': 430, 'L': 419, ':': 411, 'N': 399, 'G': 394, '!': 375, 'F': 300, 'O': 238, 'E': 229, 'z': 197, 'J': 160, 'R': 150, 'X': 75, 'V': 67, 'U': 47, 'K': 27, '(': 18, ')': 18, 'Q': 12, '1': 5, '0': 5, '7': 4, '9': 4, '3': 3, 'Z': 3, '_': 2, '2': 2, '8': 2, '&': 1, '[': 1, ']': 1})

Show the frequency of the character »`g`«

main.py

print( open( "507-0.B.txt", "r", encoding="UTF-8" ).read().count( 'g' ))
print( sum( 1 for _ in filter( lambda c: c == 'g', open( "507-0.B.txt", "r", encoding="UTF-8" ).read())))
from collections import Counter; print( Counter( open( "507-0.B.txt", "r", encoding="UTF-8" ).read() ).get( 'g' ))

Protokoll

19424
19424
19424

Show the frequency of the ten most frequent word forms

main.py

from re import split
from collections import Counter
print( Counter( split( r'\W+', open( "507-0.B.txt", "r", encoding="UTF-8" ).read() )).most_common( 10 ))

Protokoll

[('the', 9554), ('to', 6557), ('and', 6425), ('a', 4883), ('of', 4479), ('I', 3362), ('in', 3085), ('was', 3024), ('her', 2997), ('s', 2831)]

Show the frequency of a specific substring

main.py

print( open( "507-0.B.txt", "r", encoding="UTF-8" ).read().count( 'therefore' ))

Protokoll

What about overlapping occurences?

main.py

print( "dadada".count( "dada" ))

Protokoll

Show the frequency of the most frequent runs of three words

main.py

import pathlib 
import re
import collections
import sys
c = collections.Counter()
f = collections.Counter()
maxlen=0
def add( path:str, glob:str, x:int ):
    global maxlen
    path = pathlib.Path( path )
    files = path.glob( glob )
    for file in files:
        t = collections.Counter()
        print( '***', file, '***' )
        if not x:
            stream = pathlib.Path.open( file )
        else:
            stream = pathlib.Path.open( file, encoding='utf-8' )
        nr = 0
        previous = None
        pprevious = None
        p2previous = None
        p3previous = None
        for line in stream:
            nr += 1 
            if x:
                if line[ 0 ] != '*':
                    line = ''
                else:
                    line = line.split( chr( 9 ))[ 1 ]
                    line = re.sub( r'\(.*?\)', '', line )
                    line = re.sub( r'+', ' ', line )
            line = re.sub( r"' +", "'", line )
            for word in re.finditer( r"((?:\w|')+)", line ):
                word = word.group( 0 )
                if word[ 0 ] not in '0123456789':
                    word = word.lower()
                    if previous and pprevious and p2previous and p3previous: 
                        comb = p3previous + ' ' + p2previous + ' ' + pprevious + ' ' + previous + ' ' + word
                        c[ comb ] += 1
                        if len( comb )>maxlen: maxlen=len( comb )
                        if comb not in t:
                            t[ comb ]+= 1
                            f[ comb ]+= 1
                    p3previous = p2previous
                    p2previous = pprevious
                    pprevious = previous
                    previous = word
                if 'ã' in word:
                    print( '???', file, nr, '-', line, '---', word )
                    sys.exit( 0 )
add( '''text''' )
add( '''text''' )
for w, _ in f.most_common( 10000 ):
    print( f'{w:{maxlen+1}} {c[ w ]:6} {f[ w ]:4}' )
print()

The output is sorted by the number of different sources containing a triplet of words, so that many repetitions of triplet in a single source do not have a large effect on the position of the triplet in the output.

The first 74 lines of the output of this algorithm for my French Corpus (Number of occurences, Number of sources containing it)

ce que tu                 6423  785
il y a                   13954  774
qu'est ce que             7335  773
est ce que                8146  712
tout le monde             4030  684
ce que je                 3839  679
y a un                    2522  618
je ne sais                3398  579
ce n'est pas              3244  576
ce que c'est              2247  561
ce que vous               3405  559
ne sais pas               2666  555
que tu as                 1789  547
y a une                   1831  543
en train de               1775  515
je ne suis                2484  507
qu'il y a                 1622  506
que tu veux               1439  499
y a des                   2290  495
je ne peux                1845  492
je me suis                2621  491
il n'y a                  1987  490
qu'est ce qui             1573  484
y en a                    2221  482
ne peux pas               1835  482
tout de suite             1704  475
ne suis pas               1962  474
que je suis               1815  469
je crois que              1408  458
ce qui est                2312  452
tout ce que               1484  451
que tu fais               1145  450
ne peut pas               2626  450
un peu de                 1217  448
je pense que              1634  447
que je ne                 2084  447
de ne pas                 1685  444
parce que je              1118  441
à la maison               1424  440
je n'ai pas               1892  440
qu'est ce qu'il           1198  439
ce que ça                  812  433
a pas de                  1604  426
je sais pas               1375  423
ne veux pas               1300  417
je ne veux                1401  412
que vous avez             2085  411
ce que j'ai                892  411
pas du tout               1220  407
peut être que              837  404
je sais que               1005  404
ce qui se                 1185  401
je suis un                 989  397
il faut que               1608  390
n'y a pas                 1026  387
ne va pas                  882  386
ne sont pas               2184  384
à cause de                1233  383
je vais te                 843  382
la première fois          1007  382
tout à fait               2053  379
tout le temps              761  375
a t il                    1817  374
que ce soit               1290  374
on ne peut                1910  370
c'est pour ça              680  370
ce qu'il y                 830  361
si tu veux                 900  354
il a dit                  1233  353
qui se passe               909  351
je vous ai                 852  348
un peu plus               1052  347
pas besoin de              659  347

Exercises

/ Extending the algorithm

Extend the algorithm so that it will search for phrases with up to 30 words.

Exercise

The first 74 lines of the output of such an algorithm for my Japanese Corpus: "はい" ×67739 ↔2 – HA I
"あっ" ×51279 ↔2 – A SMALL-TU
"うん" ×38420 ↔2 – U N
"えっ" ×37268 ↔2 – E SMALL-TU
"いや" ×32336 ↔2 – I YA
"ああ" ×32232 ↔2 – A A
"でも" ×29729 ↔2 – DE MO
"もう" ×23096 ↔2 – MO U
"あ" ×22929 ↔1 – A
"じゃあ" ×20123 ↔3 – ZI SMALL-YA A
"ちょっと" ×19575 ↔4 – TI SMALL-YO SMALL-TU TO
"あの" ×18596 ↔2 – A NO
"え" ×16755 ↔1 – E
"ん" ×16391 ↔1 – N
"これ" ×16078 ↔2 – KO RE
"何" ×15591 ↔1 – X-4F55
"私" ×14642 ↔1 – X-79C1
"だから" ×14125 ↔3 – DA KA RA
"おい" ×13855 ↔2 – O I
"まあ" ×13307 ↔2 – MA A
"今" ×12493 ↔1 – X-4ECA
"すいません" ×11940 ↔5 – SU I MA SE N
"私は" ×11247 ↔2 – X-79C1 HA
"ええ" ×11237 ↔2 – E E
"何か" ×10606 ↔2 – X-4F55 KA
"お前" ×10569 ↔2 – O X-524D
"それは" ×10475 ↔3 – SO RE HA
"ありがとう" ×10469 ↔5 – A RI GA TO U
"そう" ×9965 ↔2 – SO U
"ねえ" ×9752 ↔2 – NE E
"あれ" ×9180 ↔2 – A RE
"どうぞ" ×8971 ↔3 – DO U ZO
"ほら" ×8948 ↔2 – HO RA
"何で" ×8822 ↔2 – X-4F55 DE
"ありがとうございます" ×8571 ↔10 – A RI GA TO U GO ZA I MA SU
"あぁ" ×8396 ↔2 – A SMALL-A
"お願いします" ×7552 ↔6 – O X-9858 I SI MA SU
"俺" ×7515 ↔1 – X-4FFA
"また" ×7467 ↔2 – MA TA
"で" ×7283 ↔1 – DE
"よし" ×7214 ↔2 – YO SI
"これは" ×7143 ↔3 – KO RE HA
"いえ" ×7052 ↔2 – I E
"はぁ" ×7008 ↔2 – HA SMALL-A
"どうして" ×6943 ↔4 – DO U SI TE
"まだ" ×6851 ↔2 – MA DA
"大丈夫" ×6838 ↔3 – X-5927 X-4E08 X-592B
"それ" ×6710 ↔2 – SO RE
"俺は" ×6630 ↔2 – X-4FFA HA
"って" ×6593 ↔2 – SMALL-TU TE
"一同" ×6477 ↔2 – X-4E00 X-540C
"だって" ×6413 ↔3 – DA SMALL-TU TE
"ごめん" ×6407 ↔3 – GO ME N
"やっぱり" ×6314 ↔4 – YA SMALL-TU PA RI
"みんな" ×6033 ↔3 – MI N NA
"先生" ×5932 ↔2 – X-5148 X-751F
"えッ" ×5932 ↔2 – E SMALL-TU
"しかし" ×5825 ↔3 – SI KA SI
"い" ×5813 ↔1 – I
"それで" ×5798 ↔3 – SO RE DE
"そして" ×5790 ↔3 – SO SI TE
"う" ×5629 ↔1 – U
"はあ" ×5624 ↔2 – HA A
"どうも" ×5535 ↔3 – DO U MO
"は" ×5511 ↔1 – HA
"では" ×5275 ↔2 – DE HA
"あッ" ×5256 ↔2 – A SMALL-TU
"さあ" ×5123 ↔2 – SA A
"ごめんなさい" ×4997 ↔6 – GO ME N NA SA I
"今日は" ×4907 ↔3 – X-4ECA X-65E5 HA
"そんな" ×4819 ↔3 – SO N NA
"はっ" ×4805 ↔2 – HA SMALL-TU
"それが" ×4802 ↔3 – SO RE GA
"すみません" ×4755 ↔5 – SU MI MA SE N

Applications of Python to linguistics

An example text file

Scrubbing

Standardization of paragraphs

Show all words of the form “s…n” in context

Show the frequency of characters

Show the frequency of the character »g«

Show the frequency of the ten most frequent word forms

Show the frequency of a specific substring

Show the frequency of the most frequent runs of three words

Exercises

/ Extending the algorithm

Exercise

Quotations

Show the frequency of the character »`g`«