-
Downloading Project Gutenberg with wget
July 18th, 2009
I’m doing a text anlaysis project to look a word frequency and co-occurrencewget -w 2 -m http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=enhttp://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pageswget -w 2 -m http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=enhttp://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_PagesI’m doing a text anlaysis project to look at word frequency and co-occurrence. I’m thinking of using 7 million medical abstracts from PubMed, 100,000 random web pages, a million random blog enteries, wikipedia and Project Gutenberg. I looked for some mirrors of Project Gutenberg and was going to wget the .txt files. A simple linux statment like the following for each directory.
wget -nd -r -l1 –no-parent -A.txt http://www.bsu.edu/libraries/gutenberg/default.asp?path=/gutenberg/etext00nd no directory, by default wget creates a dir
-r recursively download
-l1 (L one) level 1, download only of that particular folder, don’t go depth on it.
–no-parent I definately don’t want the parent’s files
-A.txt means take only files with .txt extension
… but it turns out that Project Gutenberg makes it even easier. Info on the page
http://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages
to get all of the English language txt files i just need the following statement
wget -w 2 -m http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=enThis statement even creates a nice logical directory organization of the downloaded files on your system!!
Project Gutenberg is sweet!!!!