Finding texts
When searches yield many results, one can add »filetype:txt« to find plain text files. (This could be called “reverse collocations”: One does not get collocations from a corpus, but a corpus from collocations.)
- Search input for general English texts
"to be a" "one of the" "out of the" "it was a" "on the other" "the end of" "some of the" "in the world" "this is the" "in the same"
- Long English texts
filetype:txt -gutenberg "to be a" "chair" "one of the" "street" "out of the" "beast" "it was a" "circle" "on the other" "pulled" "the end of" "asks" "some of the" "context" "in the world" "clarification" "this is the" "regular" "in the same" "responsibility"
- Long English texts (1)
filetype:txt -gutenberg "rude" "quoted" "plain" "hide" "dare" "lies" "grade" "pattern" "fame" "unknown" "rule" "detailed" "affects" "frequently"
- Search input for [transcripts of or scripts for] spoken texts
"a lot of" "one of the" "you have to" "to be a" "i don't know" "on the other" "some of the" "you want to" "out of the"
- Search input for [transcripts of or scripts for] colloquial/casual spoken texts
"like" sorta OR "sort of" kinda OR "kind of" "pretty much" "for sure" "definitely" OR defly OR deffenly "uh" "guy" OR "guys" "wanna" "go ahead"
- Can you guess text of which type these phrases find?
"this is the" "of the same" "and in the" "is in the" "was to be" "a man in" "to me i" "as to the"
- The following set of phrases finds the “opposite” type.
"part of the" "as well as" "one of the" "it is not" "some of the" "the fact that" "the end of" "of the most"
- Search free texts on Gutenberg
"almost no restrictions whatsoever." "You may copy it, give it away or" site:www.gutenberg.org/files filetype:txt -inurl:readme -readme inurl:0.txt 0.txt "to be a" "one of the" "out of the" "it was a" "on the other"
- »-0« UTF-8
- »-8« ISO-8859-1
- no digit ASCII
&num=100&safe=off&hl=en-US&ie=UTF-8&tbs=li%3A1&filter=0&start=0
- long texts
www.1024logs.com/file/IMDB_train_data.txt (19.0 MB)
www.sls.hawaii.edu/bley-vroman/750_00/brown.txt (6.0 MB)
mallet.cs.umass.edu/ap.txt (5.6 MB)
bir.brandeis.edu/bitstream/10192/26951/2/1954-1955.pdf.txt (1.0 MB)
- See also
- >724629 Sources for linguistic analyses