Applications of Python to linguistics (Applications of Python to linguistics), lesson, Seite 724594
https://www.purl.org/stefan_ram/pub/applications_python (Permalink) ist die kanonische URI dieser Seite.
Stefan Ram
Python Course

Applications of Python  to linguistics

An example text file

Adam Bede ” was written in 1859. The file with 1179 KO can be downloaded from »https://www.gutenberg.org/files/507/507-0.txt« to »507-0.txt« (into the same folder as the folder of the Python script).

Scrubbing

In “Data Jujitsu,” DJ Patil  states that “80% of the work in any data project is in cleaning the data” (2012). (quoted via “Data Science at the Command Line ” by Jeroen Janssens  [2014-09-23])

The file »507-0.txt« has some lines not part of “Adam Bede ” at the beginning and at the end that should be removed manually. Automatic identification and removal of front and end matter is beyond the scope of this course. So the file should look as follows.

Adam Bede (beginning)
ADAM BEDE

by George Eliot
Adam Bede (end)
“So there is,” said Dinah. “Run, Lisbeth, run to meet Aunt Poyser. Come
in, Adam, and rest; it has been a hard day for thee.”

When saving with a text editor often one can set the encoding of the file. Remember this encoding for later use. Save as »507-0.A.txt« (never modify an an original source).

Standardization of paragraphs

Adam Bede (excerpt)
Here some measurement was to be taken which required more concentrated
attention, and the sonorous voice subsided into a low whistle; but it
presently broke out again with renewed vigour--

     Let all thy converse be sincere,
     Thy conscience as the noonday clear.

strategy:

main.py

import re

with open( "507-0.A.txt", "r", encoding="UTF-8" )as f:
text = f.read()

text = re.sub( r"\n +", r"\n", text )
text = re.sub( r"\n\n+", r"@", text )
text = re.sub( r"\n", r" ", text )
text = re.sub( r"@", r"\n", text )
if text[ -1 ]!= '\n': text += '\n'

with open( "507-0.B.txt", "w", encoding="UTF-8" )as f:
f.write( text )

Show all words of the form “s…n” in context

strategy:

We ₍ˈwʌnə₎ want to see the first 10 results only and prefer simple quotation marks instead of typographic quotation marks.

main.py

from re import finditer, sub

with open( "507-0.B.txt", "r", encoding="UTF-8" )as f:
text = f.read()

text = sub( r"\n", r"@ ", text )
text = sub( r"“", r'"', text )
text = sub( r"”", r'"', text )
text = ' ' * 1000 + text + ' ' * 1000

for context in list( finditer( r".{30}\bs\w*n\b.{60}", text ))[ :10 ]:
print( context.group( 0 )[ :70 ])

Protokoll
our Lord 1799.@ The afternoon sun was warm on the five workmen there, 
Awake, my soul, and with the sun Thy daily stage of duty run; Shake o
help laughing at myself."@ "I shan't loose him till he promises to let
" said Ben, "but I donna mind sayin' as I'll let 't alone at your aski
. Ye'll do finely t' lead the singin'. But I don' know what Parson Irw
d things at Will Maskery's. I shan't be home before going for ten. I'l
Preaching@ About a quarter to seven there was an unusual appearance of
e in which the weather-beaten sign left him as to the heraldic bearing
econcile his dignity with the satisfaction of his curiosity by walking
head: many of 'em goes stark starin' mad wi' their religion. Though t

Further clarification of the treatment of strings like “shan't ” or “starin' ” might be needed.

(Is this called “concordance”?)

Show the frequency of characters

main.py

from collections import Counter

print( Counter( open( "507-0.B.txt", "r", encoding="UTF-8" ).read() ))

Protokoll
Counter({' ': 211049, 'e': 108848, 't': 80842, 'a': 70372, 'o': 69383, 'n': 61990, 'h': 60459, 's': 55755, 'i': 53390, 'r': 50151, 'd': 37827, 'l': 34912, 'u': 25211, 'w': 22128, 'm': 21341, 'g': 19424, 'y': 18810, 'f': 18810, 'c': 17203, ',': 15329, 'b': 13116, 'p': 12160, "'": 9064, '.': 8935, 'k': 8821, 'v': 7362, '-': 4798, 'I': 4527, '{$c65533}': 2719, '{$c65533}': 2689, 'A': 2636, '\n': 2569, 'H': 1762, ';': 1709, 'T': 1616, 'M': 1525, 'S': 1364, 'B': 1266, 'x': 947, 'D': 867, 'W': 790, '?': 772, 'q': 741, 'j': 699, 'P': 659, 'C': 480, 'Y': 430, 'L': 419, ':': 411, 'N': 399, 'G': 394, '!': 375, 'F': 300, 'O': 238, 'E': 229, 'z': 197, 'J': 160, 'R': 150, 'X': 75, 'V': 67, 'U': 47, 'K': 27, '(': 18, ')': 18, 'Q': 12, '1': 5, '0': 5, '7': 4, '9': 4, '3': 3, 'Z': 3, '_': 2, '2': 2, '8': 2, '&': 1, '[': 1, ']': 1})

Show the frequency of the character »g«

main.py

print( open( "507-0.B.txt", "r", encoding="UTF-8" ).read().count( 'g' ))

print( sum( 1 for _ in filter( lambda c: c == 'g', open( "507-0.B.txt", "r", encoding="UTF-8" ).read())))

from collections import Counter; print( Counter( open( "507-0.B.txt", "r", encoding="UTF-8" ).read() ).get( 'g' ))

Protokoll
19424
19424
19424

Show the frequency of the ten most frequent word forms

main.py

from re import split

from collections import Counter

print( Counter( split( r'\W+', open( "507-0.B.txt", "r", encoding="UTF-8" ).read() )).most_common( 10 ))

Protokoll
[('the', 9554), ('to', 6557), ('and', 6425), ('a', 4883), ('of', 4479), ('I', 3362), ('in', 3085), ('was', 3024), ('her', 2997), ('s', 2831)]

Show the frequency of a specific substring

main.py
print( open( "507-0.B.txt", "r", encoding="UTF-8" ).read().count( 'therefore' ))
Protokoll
2

What about overlapping occurences?

main.py
print( "dadada".count( "dada" ))
Protokoll
1

Show the frequency of the most frequent runs of three words

main.py

import pathlib

import re

import collections

import sys

c = collections.Counter()

f = collections.Counter()

maxlen=0

def add( path:str, glob:str, x:int ):

global maxlen

path = pathlib.Path( path )

files = path.glob( glob )

for file in files:

t = collections.Counter()

print( '***', file, '***' )

if not x:

stream = pathlib.Path.open( file )

else:

stream = pathlib.Path.open( file, encoding='utf-8' )

nr = 0

previous = None

pprevious = None

p2previous = None

p3previous = None

for line in stream:

nr += 1

if x:

if line[ 0 ] != '*':

line = ''

else:

line = line.split( chr( 9 ))[ 1 ]

line = re.sub( r'\(.*?\)', '', line )

line = re.sub( r'+', ' ', line )

line = re.sub( r"' +", "'", line )

for word in re.finditer( r"((?:\w|')+)", line ):

word = word.group( 0 )

if word[ 0 ] not in '0123456789':

word = word.lower()

if previous and pprevious and p2previous and p3previous:

comb = p3previous + ' ' + p2previous + ' ' + pprevious + ' ' + previous + ' ' + word

c[ comb ] += 1

if len( comb )>maxlen: maxlen=len( comb )

if comb not in t:

t[ comb ]+= 1

f[ comb ]+= 1

p3previous = p2previous

p2previous = pprevious

pprevious = previous

previous = word

if 'ã' in word:

print( '???', file, nr, '-', line, '---', word )

sys.exit( 0 )

add( '''text''' )

add( '''text''' )

for w, _ in f.most_common( 10000 ):

print( f'{w:{maxlen+1}} {c[ w ]:6} {f[ w ]:4}' )

print()

The output is sorted by the number of different sources containing a triplet of words, so that many repetitions of triplet in a single source do not have a large effect on the position of the triplet in the output.

The first 74 lines of the output of this algorithm for my French Corpus (Number of occurences, Number of sources containing it)
ce que tu                 6423  785
il y a 13954 774
qu'est ce que 7335 773
est ce que 8146 712
tout le monde 4030 684
ce que je 3839 679
y a un 2522 618
je ne sais 3398 579
ce n'est pas 3244 576
ce que c'est 2247 561
ce que vous 3405 559
ne sais pas 2666 555
que tu as 1789 547
y a une 1831 543
en train de 1775 515
je ne suis 2484 507
qu'il y a 1622 506
que tu veux 1439 499
y a des 2290 495
je ne peux 1845 492
je me suis 2621 491
il n'y a 1987 490
qu'est ce qui 1573 484
y en a 2221 482
ne peux pas 1835 482
tout de suite 1704 475
ne suis pas 1962 474
que je suis 1815 469
je crois que 1408 458
ce qui est 2312 452
tout ce que 1484 451
que tu fais 1145 450
ne peut pas 2626 450
un peu de 1217 448
je pense que 1634 447
que je ne 2084 447
de ne pas 1685 444
parce que je 1118 441
à la maison 1424 440
je n'ai pas 1892 440
qu'est ce qu'il 1198 439
ce que ça 812 433
a pas de 1604 426
je sais pas 1375 423
ne veux pas 1300 417
je ne veux 1401 412
que vous avez 2085 411
ce que j'ai 892 411
pas du tout 1220 407
peut être que 837 404
je sais que 1005 404
ce qui se 1185 401
je suis un 989 397
il faut que 1608 390
n'y a pas 1026 387
ne va pas 882 386
ne sont pas 2184 384
à cause de 1233 383
je vais te 843 382
la première fois 1007 382
tout à fait 2053 379
tout le temps 761 375
a t il 1817 374
que ce soit 1290 374
on ne peut 1910 370
c'est pour ça 680 370
ce qu'il y 830 361
si tu veux 900 354
il a dit 1233 353
qui se passe 909 351
je vous ai 852 348
un peu plus 1052 347
pas besoin de 659 347

Exercises

/   Extending the algorithm

Extend the algorithm so that it will search for phrases with up to 30 words.

Exercise

The first 74 lines of the output of such an algorithm for my Japanese Corpus
"はい" ×67739 ↔2 – HA I
"あっ" ×51279 ↔2 – A SMALL-TU
"うん" ×38420 ↔2 – U N
"えっ" ×37268 ↔2 – E SMALL-TU
"いや" ×32336 ↔2 – I YA
"ああ" ×32232 ↔2 – A A
"でも" ×29729 ↔2 – DE MO
"もう" ×23096 ↔2 – MO U
"あ" ×22929 ↔1 – A
"じゃあ" ×20123 ↔3 – ZI SMALL-YA A
"ちょっと" ×19575 ↔4 – TI SMALL-YO SMALL-TU TO
"あの" ×18596 ↔2 – A NO
"え" ×16755 ↔1 – E
"ん" ×16391 ↔1 – N
"これ" ×16078 ↔2 – KO RE
"何" ×15591 ↔1 – X-4F55
"私" ×14642 ↔1 – X-79C1
"だから" ×14125 ↔3 – DA KA RA
"おい" ×13855 ↔2 – O I
"まあ" ×13307 ↔2 – MA A
"今" ×12493 ↔1 – X-4ECA
"すいません" ×11940 ↔5 – SU I MA SE N
"私は" ×11247 ↔2 – X-79C1 HA
"ええ" ×11237 ↔2 – E E
"何か" ×10606 ↔2 – X-4F55 KA
"お前" ×10569 ↔2 – O X-524D
"それは" ×10475 ↔3 – SO RE HA
"ありがとう" ×10469 ↔5 – A RI GA TO U
"そう" ×9965 ↔2 – SO U
"ねえ" ×9752 ↔2 – NE E
"あれ" ×9180 ↔2 – A RE
"どうぞ" ×8971 ↔3 – DO U ZO
"ほら" ×8948 ↔2 – HO RA
"何で" ×8822 ↔2 – X-4F55 DE
"ありがとうございます" ×8571 ↔10 – A RI GA TO U GO ZA I MA SU
"あぁ" ×8396 ↔2 – A SMALL-A
"お願いします" ×7552 ↔6 – O X-9858 I SI MA SU
"俺" ×7515 ↔1 – X-4FFA
"また" ×7467 ↔2 – MA TA
"で" ×7283 ↔1 – DE
"よし" ×7214 ↔2 – YO SI
"これは" ×7143 ↔3 – KO RE HA
"いえ" ×7052 ↔2 – I E
"はぁ" ×7008 ↔2 – HA SMALL-A
"どうして" ×6943 ↔4 – DO U SI TE
"まだ" ×6851 ↔2 – MA DA
"大丈夫" ×6838 ↔3 – X-5927 X-4E08 X-592B
"それ" ×6710 ↔2 – SO RE
"俺は" ×6630 ↔2 – X-4FFA HA
"って" ×6593 ↔2 – SMALL-TU TE
"一同" ×6477 ↔2 – X-4E00 X-540C
"だって" ×6413 ↔3 – DA SMALL-TU TE
"ごめん" ×6407 ↔3 – GO ME N
"やっぱり" ×6314 ↔4 – YA SMALL-TU PA RI
"みんな" ×6033 ↔3 – MI N NA
"先生" ×5932 ↔2 – X-5148 X-751F
"えッ" ×5932 ↔2 – E SMALL-TU
"しかし" ×5825 ↔3 – SI KA SI
"い" ×5813 ↔1 – I
"それで" ×5798 ↔3 – SO RE DE
"そして" ×5790 ↔3 – SO SI TE
"う" ×5629 ↔1 – U
"はあ" ×5624 ↔2 – HA A
"どうも" ×5535 ↔3 – DO U MO
"は" ×5511 ↔1 – HA
"では" ×5275 ↔2 – DE HA
"あッ" ×5256 ↔2 – A SMALL-TU
"さあ" ×5123 ↔2 – SA A
"ごめんなさい" ×4997 ↔6 – GO ME N NA SA I
"今日は" ×4907 ↔3 – X-4ECA X-65E5 HA
"そんな" ×4819 ↔3 – SO N NA
"はっ" ×4805 ↔2 – HA SMALL-TU
"それが" ×4802 ↔3 – SO RE GA
"すみません" ×4755 ↔5 – SU MI MA SE N

Quotations

Seiteninformationen und Impressum   |   Mitteilungsformular  |   "ram@zedat.fu-berlin.de" (ohne die Anführungszeichen) ist die Netzpostadresse von Stefan Ram.   |   Eine Verbindung zur Stefan-Ram-Startseite befindet sich oben auf dieser Seite hinter dem Text "Stefan Ram".)  |   Der Urheber dieses Textes ist Stefan Ram. Alle Rechte sind vorbehalten. Diese Seite ist eine Veröffentlichung von Stefan Ram. Schlüsselwörter zu dieser Seite/relevant keywords describing this page: Stefan Ram Berlin slrprd slrprd stefanramberlin spellched stefanram724594 stefan_ram:724594 Applications of Python to linguistics Stefan Ram, Berlin, and, or, near, uni, online, slrprd, slrprdqxx, slrprddoc, slrprd724594, slrprddef724594, PbclevtugFgrsnaEnz Erklärung, Beschreibung, Info, Information, Hinweis,

Der Urheber dieses Textes ist Stefan Ram. Alle Rechte sind vorbehalten. Diese Seite ist eine Veröffentlichung von Stefan Ram.
https://www.purl.org/stefan_ram/pub/applications_python