Applications of Python to linguistics
An example text file
“Adam Bede ” was written in 1859. The file with 1179 KO can be downloaded from »https://www.gutenberg.org/files/507/507-0.txt« to »507-0.txt« (into the same folder as the folder of the Python script).
Scrubbing
In “Data Jujitsu,” DJ Patil states that “80% of the work in any data project is in cleaning the data” (2012). (quoted via “Data Science at the Command Line ” by Jeroen Janssens [2014-09-23])
The file »507-0.txt« has some lines not part of “Adam Bede ” at the beginning and at the end that should be removed manually. Automatic identification and removal of front and end matter is beyond the scope of this course. So the file should look as follows.
- Adam Bede (beginning)
ADAM BEDE
by George Eliot- Adam Bede (end)
“So there is,” said Dinah. “Run, Lisbeth, run to meet Aunt Poyser. Come
in, Adam, and rest; it has been a hard day for thee.”
When saving with a text editor often one can set the encoding of the file. Remember this encoding for later use. Save as »507-0.A.txt« (never modify an an original source).
Standardization of paragraphs
- Adam Bede (excerpt)
- Here some measurement was to be taken which required more concentrated↵
attention, and the sonorous voice subsided into a low whistle; but it↵
presently broke out again with renewed vigour--↵
↵
Let all thy converse be sincere,↵
Thy conscience as the noonday clear.↵
strategy:
- load »507-0.A.txt«
- replace the pattern »\n +« by just »\n« to remove all indentations
- replace »\n\n+« temporarily by »@« to keep paragraph breaks
- replace »\n« by a space » «
- replace »@« by »\n« to create paragraph breaks
- if the text does not end with a »\n«, add one
- write it into a file with a different name »507-0.B.txt« (never modify the result of a previous work step)
main.py
import re
with open( "507-0.A.txt", "r", encoding="UTF-8" )as f:
text = f.read()text = re.sub( r"\n +", r"\n", text )
text = re.sub( r"\n\n+", r"@", text )
text = re.sub( r"\n", r" ", text )
text = re.sub( r"@", r"\n", text )
if text[ -1 ]!= '\n': text += '\n'with open( "507-0.B.txt", "w", encoding="UTF-8" )as f:
f.write( text )
Show all words of the form “s…n” in context
strategy:
- load »507-0.B.txt«
- replace »\n« by »@ « (marking an end of a paragraph)
- prepend and append 1000 space for uniformity
- search for: 30 characters, the pattern s…n, 60 characters, i.e., ».{30}\bs\w*n\b.{60}«
- cut it to 70 characters and print it
We ₍ˈwʌnə₎ want to see the first 10 results only and prefer simple quotation marks instead of typographic quotation marks.
main.py
from re import finditer, sub
with open( "507-0.B.txt", "r", encoding="UTF-8" )as f:
text = f.read()text = sub( r"\n", r"@ ", text )
text = sub( r"“", r'"', text )
text = sub( r"”", r'"', text )
text = ' ' * 1000 + text + ' ' * 1000for context in list( finditer( r".{30}\bs\w*n\b.{60}", text ))[ :10 ]:
print( context.group( 0 )[ :70 ])- Protokoll
our Lord 1799.@ The afternoon sun was warm on the five workmen there,
Awake, my soul, and with the sun Thy daily stage of duty run; Shake o
help laughing at myself."@ "I shan't loose him till he promises to let
" said Ben, "but I donna mind sayin' as I'll let 't alone at your aski
. Ye'll do finely t' lead the singin'. But I don' know what Parson Irw
d things at Will Maskery's. I shan't be home before going for ten. I'l
Preaching@ About a quarter to seven there was an unusual appearance of
e in which the weather-beaten sign left him as to the heraldic bearing
econcile his dignity with the satisfaction of his curiosity by walking
head: many of 'em goes stark starin' mad wi' their religion. Though t
Further clarification of the treatment of strings like “shan't ” or “starin' ” might be needed.
(Is this called “concordance”?)
Show the frequency of characters
main.py
from collections import Counter
print( Counter( open( "507-0.B.txt", "r", encoding="UTF-8" ).read() ))
- Protokoll
Counter({' ': 211049, 'e': 108848, 't': 80842, 'a': 70372, 'o': 69383, 'n': 61990, 'h': 60459, 's': 55755, 'i': 53390, 'r': 50151, 'd': 37827, 'l': 34912, 'u': 25211, 'w': 22128, 'm': 21341, 'g': 19424, 'y': 18810, 'f': 18810, 'c': 17203, ',': 15329, 'b': 13116, 'p': 12160, "'": 9064, '.': 8935, 'k': 8821, 'v': 7362, '-': 4798, 'I': 4527, '{$c65533}': 2719, '{$c65533}': 2689, 'A': 2636, '\n': 2569, 'H': 1762, ';': 1709, 'T': 1616, 'M': 1525, 'S': 1364, 'B': 1266, 'x': 947, 'D': 867, 'W': 790, '?': 772, 'q': 741, 'j': 699, 'P': 659, 'C': 480, 'Y': 430, 'L': 419, ':': 411, 'N': 399, 'G': 394, '!': 375, 'F': 300, 'O': 238, 'E': 229, 'z': 197, 'J': 160, 'R': 150, 'X': 75, 'V': 67, 'U': 47, 'K': 27, '(': 18, ')': 18, 'Q': 12, '1': 5, '0': 5, '7': 4, '9': 4, '3': 3, 'Z': 3, '_': 2, '2': 2, '8': 2, '&': 1, '[': 1, ']': 1})
Show the frequency of the character »g«
main.py
print( open( "507-0.B.txt", "r", encoding="UTF-8" ).read().count( 'g' ))
print( sum( 1 for _ in filter( lambda c: c == 'g', open( "507-0.B.txt", "r", encoding="UTF-8" ).read())))
from collections import Counter; print( Counter( open( "507-0.B.txt", "r", encoding="UTF-8" ).read() ).get( 'g' ))
- Protokoll
19424
19424
19424
Show the frequency of the ten most frequent word forms
main.py
from re import split
from collections import Counter
print( Counter( split( r'\W+', open( "507-0.B.txt", "r", encoding="UTF-8" ).read() )).most_common( 10 ))
- Protokoll
[('the', 9554), ('to', 6557), ('and', 6425), ('a', 4883), ('of', 4479), ('I', 3362), ('in', 3085), ('was', 3024), ('her', 2997), ('s', 2831)]
Show the frequency of a specific substring
main.py
print( open( "507-0.B.txt", "r", encoding="UTF-8" ).read().count( 'therefore' ))
- Protokoll
2
What about overlapping occurences?
main.py
print( "dadada".count( "dada" ))
- Protokoll
1
Show the frequency of the most frequent runs of three words
main.py
import pathlib
import re
import collections
import sys
c = collections.Counter()
f = collections.Counter()
maxlen=0
def add( path:str, glob:str, x:int ):
global maxlen
path = pathlib.Path( path )
files = path.glob( glob )
for file in files:
t = collections.Counter()
print( '***', file, '***' )
if not x:
stream = pathlib.Path.open( file )
else:
stream = pathlib.Path.open( file, encoding='utf-8' )
nr = 0
previous = None
pprevious = None
p2previous = None
p3previous = None
for line in stream:
nr += 1
if x:
if line[ 0 ] != '*':
line = ''
else:
line = line.split( chr( 9 ))[ 1 ]
line = re.sub( r'\(.*?\)', '', line )
line = re.sub( r'+', ' ', line )
line = re.sub( r"' +", "'", line )
for word in re.finditer( r"((?:\w|')+)", line ):
word = word.group( 0 )
if word[ 0 ] not in '0123456789':
word = word.lower()
if previous and pprevious and p2previous and p3previous:
comb = p3previous + ' ' + p2previous + ' ' + pprevious + ' ' + previous + ' ' + word
c[ comb ] += 1
if len( comb )>maxlen: maxlen=len( comb )
if comb not in t:
t[ comb ]+= 1
f[ comb ]+= 1
p3previous = p2previous
p2previous = pprevious
pprevious = previous
previous = word
if 'ã' in word:
print( '???', file, nr, '-', line, '---', word )
sys.exit( 0 )
add( '''text''' )
add( '''text''' )
for w, _ in f.most_common( 10000 ):
print( f'{w:{maxlen+1}} {c[ w ]:6} {f[ w ]:4}' )
print()
The output is sorted by the number of different sources containing a triplet of words, so that many repetitions of triplet in a single source do not have a large effect on the position of the triplet in the output.
- The first 74 lines of the output of this algorithm for my French Corpus (Number of occurences, Number of sources containing it)
ce que tu 6423 785
il y a 13954 774
qu'est ce que 7335 773
est ce que 8146 712
tout le monde 4030 684
ce que je 3839 679
y a un 2522 618
je ne sais 3398 579
ce n'est pas 3244 576
ce que c'est 2247 561
ce que vous 3405 559
ne sais pas 2666 555
que tu as 1789 547
y a une 1831 543
en train de 1775 515
je ne suis 2484 507
qu'il y a 1622 506
que tu veux 1439 499
y a des 2290 495
je ne peux 1845 492
je me suis 2621 491
il n'y a 1987 490
qu'est ce qui 1573 484
y en a 2221 482
ne peux pas 1835 482
tout de suite 1704 475
ne suis pas 1962 474
que je suis 1815 469
je crois que 1408 458
ce qui est 2312 452
tout ce que 1484 451
que tu fais 1145 450
ne peut pas 2626 450
un peu de 1217 448
je pense que 1634 447
que je ne 2084 447
de ne pas 1685 444
parce que je 1118 441
à la maison 1424 440
je n'ai pas 1892 440
qu'est ce qu'il 1198 439
ce que ça 812 433
a pas de 1604 426
je sais pas 1375 423
ne veux pas 1300 417
je ne veux 1401 412
que vous avez 2085 411
ce que j'ai 892 411
pas du tout 1220 407
peut être que 837 404
je sais que 1005 404
ce qui se 1185 401
je suis un 989 397
il faut que 1608 390
n'y a pas 1026 387
ne va pas 882 386
ne sont pas 2184 384
à cause de 1233 383
je vais te 843 382
la première fois 1007 382
tout à fait 2053 379
tout le temps 761 375
a t il 1817 374
que ce soit 1290 374
on ne peut 1910 370
c'est pour ça 680 370
ce qu'il y 830 361
si tu veux 900 354
il a dit 1233 353
qui se passe 909 351
je vous ai 852 348
un peu plus 1052 347
pas besoin de 659 347
Exercises
/ Extending the algorithm
Extend the algorithm so that it will search for phrases with up to 30 words.
Exercise
- The first 74 lines of the output of such an algorithm for my Japanese Corpus
- "はい" ×67739 ↔2 – HA I
"あっ" ×51279 ↔2 – A SMALL-TU
"うん" ×38420 ↔2 – U N
"えっ" ×37268 ↔2 – E SMALL-TU
"いや" ×32336 ↔2 – I YA
"ああ" ×32232 ↔2 – A A
"でも" ×29729 ↔2 – DE MO
"もう" ×23096 ↔2 – MO U
"あ" ×22929 ↔1 – A
"じゃあ" ×20123 ↔3 – ZI SMALL-YA A
"ちょっと" ×19575 ↔4 – TI SMALL-YO SMALL-TU TO
"あの" ×18596 ↔2 – A NO
"え" ×16755 ↔1 – E
"ん" ×16391 ↔1 – N
"これ" ×16078 ↔2 – KO RE
"何" ×15591 ↔1 – X-4F55
"私" ×14642 ↔1 – X-79C1
"だから" ×14125 ↔3 – DA KA RA
"おい" ×13855 ↔2 – O I
"まあ" ×13307 ↔2 – MA A
"今" ×12493 ↔1 – X-4ECA
"すいません" ×11940 ↔5 – SU I MA SE N
"私は" ×11247 ↔2 – X-79C1 HA
"ええ" ×11237 ↔2 – E E
"何か" ×10606 ↔2 – X-4F55 KA
"お前" ×10569 ↔2 – O X-524D
"それは" ×10475 ↔3 – SO RE HA
"ありがとう" ×10469 ↔5 – A RI GA TO U
"そう" ×9965 ↔2 – SO U
"ねえ" ×9752 ↔2 – NE E
"あれ" ×9180 ↔2 – A RE
"どうぞ" ×8971 ↔3 – DO U ZO
"ほら" ×8948 ↔2 – HO RA
"何で" ×8822 ↔2 – X-4F55 DE
"ありがとうございます" ×8571 ↔10 – A RI GA TO U GO ZA I MA SU
"あぁ" ×8396 ↔2 – A SMALL-A
"お願いします" ×7552 ↔6 – O X-9858 I SI MA SU
"俺" ×7515 ↔1 – X-4FFA
"また" ×7467 ↔2 – MA TA
"で" ×7283 ↔1 – DE
"よし" ×7214 ↔2 – YO SI
"これは" ×7143 ↔3 – KO RE HA
"いえ" ×7052 ↔2 – I E
"はぁ" ×7008 ↔2 – HA SMALL-A
"どうして" ×6943 ↔4 – DO U SI TE
"まだ" ×6851 ↔2 – MA DA
"大丈夫" ×6838 ↔3 – X-5927 X-4E08 X-592B
"それ" ×6710 ↔2 – SO RE
"俺は" ×6630 ↔2 – X-4FFA HA
"って" ×6593 ↔2 – SMALL-TU TE
"一同" ×6477 ↔2 – X-4E00 X-540C
"だって" ×6413 ↔3 – DA SMALL-TU TE
"ごめん" ×6407 ↔3 – GO ME N
"やっぱり" ×6314 ↔4 – YA SMALL-TU PA RI
"みんな" ×6033 ↔3 – MI N NA
"先生" ×5932 ↔2 – X-5148 X-751F
"えッ" ×5932 ↔2 – E SMALL-TU
"しかし" ×5825 ↔3 – SI KA SI
"い" ×5813 ↔1 – I
"それで" ×5798 ↔3 – SO RE DE
"そして" ×5790 ↔3 – SO SI TE
"う" ×5629 ↔1 – U
"はあ" ×5624 ↔2 – HA A
"どうも" ×5535 ↔3 – DO U MO
"は" ×5511 ↔1 – HA
"では" ×5275 ↔2 – DE HA
"あッ" ×5256 ↔2 – A SMALL-TU
"さあ" ×5123 ↔2 – SA A
"ごめんなさい" ×4997 ↔6 – GO ME N NA SA I
"今日は" ×4907 ↔3 – X-4ECA X-65E5 HA
"そんな" ×4819 ↔3 – SO N NA
"はっ" ×4805 ↔2 – HA SMALL-TU
"それが" ×4802 ↔3 – SO RE GA
"すみません" ×4755 ↔5 – SU MI MA SE N