Tag: Project Gutenberg
  • Downloading Project Gutenberg with wget

    July 18th, 2009

    I’m doing a text anlaysis project to look a word frequency and co-occurrence
    wget -w 2 -m http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=enhttp://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages
    wget -w 2 -m http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=en
    http://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages

    I’m doing a text anlaysis project to look at word frequency and co-occurrence. I’m thinking of using 7 million medical abstracts from PubMed, 100,000 random web pages, a million random blog enteries, wikipedia and Project Gutenberg. I looked for some mirrors of Project Gutenberg and was going to wget the .txt files. A simple linux statment like the following for each directory.

    wget -nd -r -l1 –no-parent -A.txt http://www.bsu.edu/libraries/gutenberg/default.asp?path=/gutenberg/etext00

    nd no directory, by default wget creates a dir

    -r recursively download

    -l1 (L one) level 1, download only of that particular folder, don’t go depth on it.

    –no-parent I definately don’t want the parent’s files

    -A.txt means take only files with .txt extension

    … but it turns out that Project Gutenberg makes it even easier. Info on the page

    http://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages

    to get all of the English language txt files i just need the following statement

    wget -w 2 -m http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=en

    This statement even creates a nice logical directory organization of the downloaded files on your system!!

    Project Gutenberg is sweet!!!!