finding texts (finding texts), notes, page 724659
https://www.purl.org/stefan_ram/pub/finding_text_linguistics (permalink) is the canonical URI of this page.
Stefan Ram
Python Course

Finding texts

When searches yield many results, one can add »filetype:txt« to find plain text files. (This could be called “reverse collocations”: One does not get collocations from a corpus, but a corpus from collocations.)

Search input for general English texts
"to be a" "one of the" "out of the" "it was a" "on the other" "the end of" "some of the" "in the world" "this is the" "in the same"
Long English texts
filetype:txt -gutenberg  "to be a"  "chair"  "one of the"  "street"  "out of the"  "beast"  "it was a"  "circle"  "on the other"  "pulled"  "the end of" "asks"  "some of the"  "context"  "in the world"  "clarification"  "this is the"  "regular" "in the same" "responsibility"
Long English texts (1)
filetype:txt -gutenberg "rude" "quoted" "plain" "hide" "dare" "lies" "grade" "pattern" "fame" "unknown" "rule" "detailed" "affects" "frequently"
Search input for [transcripts of or scripts for] spoken texts
"a lot of" "one of the" "you have to" "to be a" "i don't know" "on the other" "some of the" "you want to" "out of the"
Search input for [transcripts of or scripts for] colloquial/casual spoken texts
"like" sorta OR "sort of" kinda OR "kind of" "pretty much" "for sure" "definitely" OR defly OR deffenly "uh" "guy" OR "guys" "wanna" "go ahead"
Can you guess text of which type these phrases find?
"this is the" "of the same" "and in the" "is in the" "was to be" "a man in" "to me i" "as to the"
The following set of phrases finds the “opposite” type.
"part of the" "as well as" "one of the" "it is not" "some of the" "the fact that" "the end of" "of the most"
Search free texts on Gutenberg
"almost no restrictions whatsoever." "You may copy it, give it away or" site:www.gutenberg.org/files filetype:txt -inurl:readme -readme inurl:0.txt 0.txt "to be a" "one of the" "out of the" "it was a" "on the other"
»-0« UTF-8
»-8« ISO-8859-1
no digit ASCII
&num=100&safe=off&hl=en-US&ie=UTF-8&tbs=li%3A1&filter=0&start=0
long texts

www.1024logs.com/file/IMDB_train_data.txt (19.0 MB)

www.sls.hawaii.edu/bley-vroman/750_00/brown.txt (6.0 MB)

mallet.cs.umass.edu/ap.txt (5.6 MB)

bir.brandeis.edu/bitstream/10192/26951/2/1954-1955.pdf.txt (1.0 MB)

See also
>724629 Sources for linguistic analyses

About this page, Impressum  |   Form for messages to the publisher regarding this page  |   "ram@zedat.fu-berlin.de" (without the quotation marks) is the email address of Stefan Ram.   |   A link to the start page of Stefan Ram appears at the top of this page behind the text "Stefan Ram".)  |   Copyright 1998-2020 Stefan Ram, Berlin. All rights reserved. This page is a publication by Stefan Ram. relevant keywords describing this page: Stefan Ram Berlin slrprd slrprd stefanramberlin spellched stefanram724659 stefan_ram:724659 finding texts Stefan Ram, Berlin, and, or, near, uni, online, slrprd, slrprdqxx, slrprddoc, slrprd724659, slrprddef724659, PbclevtugFgrsnaEnz Explanation, description, info, information, note,

Copyright 1998-2020 Stefan Ram, Berlin. All rights reserved. This page is a publication by Stefan Ram.
https://www.purl.org/stefan_ram/pub/finding_text_linguistics