Real-World Data Sets

Here is a list of real-world data sets collected from the web.

dickens.txt 28M nearly the complete works of Charles Dickens text file Project Gutenberg
magna-carta.txt 78K Magna Carta text file Project Gutenberg
war+peace.txt 3M War and Peace text file Project Gutenberg
chromosome4.txt 10M human chromosome 4 text file Project Gutenberg
chromosome11.txt 7M human chromosome 11 text file Project Gutenberg
ecoli.txt 4M ecoli genome text file Project Gutenberg
world192.txt 2M World Factbook 1992 text file Project Gutenberg
tale.txt 779K Tale of Two Cities text file Project Gutenberg
TomSawyer.txt 406K Tom Sawyer text file Project Gutenberg
bible.txt 4M The Bible text file
mobydick.txt 1M Moby Dick text file
aesop.txt 187K Aesop's Fables text file
manifesto.txt 71K Communist Manifesto text file
lilwomen.txt 1018K Little Women text file
muchado.txt 121K Much Ado About Nothing text file
amendments.txt 18K amendments to constitution text file
bush-kerry1.txt 82K Bush-Kerry debate 1 text file
bush-kerry2.txt 92K Bush-Kerry debate 2 text file
bush-kerry3.txt 85K Bush-Kerry debate 3 text file
obama-mccain1.txt 90K Obama-McCain debate 1 text file
obama-mccain2.txt 88K Obama-McCain debate 2 text file
obama-mccain3.txt 85K Obama-McCain debate 3 text file
pi-10million.txt 10M first 10 million digits of pi text file
pi-1million.txt 977K first 1 million digits of pi text file
elements.csv 5K periodic table of elements CSV
surnames.csv 9M 151,671 surnames by race/ethnicity CSV 2000 US Census
ip-by-country.csv 5M IP address ranges by country CSV MaxMind
dma.csv 4K designated market area code CSV MaxMind
misspellings.csv 46K common misspellings CSV Wikipeda
starbucks.csv 619K latitude and longitude of Starbucks CSV POI Factory
wendys.csv 538K latitude and longitude of Wendys CSV POI Factory
mcdonalds.csv 1M latitude and longitude of McDonalds CSV POI Factory
burgerking.csv 662K latitude and longitude of Burger Kings CSV POI Factory
walmart.csv 468K latitude and longitude of Walmarts CSV POI Factory
homedepot.csv 216K latitude and longitude of Home Depots CSV POI Factory
dairyqueen.csv 459K latitude and longitude of Dairy Queens CSV POI Factory
pizzahut.csv 551K latitude and longitude of Pizza Huts CSV POI Factory
zips1990.csv 2M latitude and longitude by zip code in 1990 CSV Tiger
zips1990-full.csv 2M latitude and longitude by zip code in 1990 CSV Tiger
zips1999.tsv 2M latitude and longitude by zip code in 1999 CSV Ben Fry
zips2000.csv 964K latitude and longitude by zip code in 2000 CSV Tiger
calories.csv 43K calories for various food items CSV
DJIA.csv 1M Dow Jones Industrial Average CSV
olympic-medals2012.csv 49K Olympic medals at London 2012 Olympics CSV
McGuireAFB.csv 278K temperature at McGuires Air Force Base two column Robert Vanderbei
morse.csv 240 Morse code CSV
amino.csv 1K amino acids CSV
names.csv 103K names and their meanings CSV
codes.csv 820 states and FIPS codes CSV Tiger
phone-na.csv 28K North American telephone codes CSV
phone-international.csv 3K international telephone codes CSV
airports.csv 5K airport codes CSV
psychiatric.csv 18K psychiatric disorders and DSM codes CSV
fortune500.csv 1018K Fortune 500 companies CSV
fortune1000.csv 23K Fortune 1000 companies CSV
language.csv 196K common words translated in 15 languaages CSV
synsets.txt 8M Wordnet synonym sets CSV WordNet
hypernyms.txt 952K Wordnet hypernym relations CSV WordNet
countries.csv 7K countries, capitals, and country codes CSV
iso3166.csv 4K ISO 3166 country codes CSV MaxMind
fips10_4.csv 73K FIPS 10-4 subcountry codes CSV MaxMind
bnc-wordfreq.csv 122K frequency of words in British National Corpus CSV Adam Kilgarriff
upc-glns.csv 5M manufacturers by 13-digit GLN CSV
upc-items.csv 58M items by UPC code CSV
sdss174052.csv 20M .1% of Sloan Digital Sky galaxy objects CSV Sloan Digital Sky
sdss1738478.csv 201M 1% of Sloan Digital Sky galaxy objects CSV Sloan Digital Sky
sdss6949386.csv 804M 4% of Sloan Digital Sky galaxy objects CSV Sloan Digital Sky
comets.csv 3K comets CSV Home Planet
meteors.csv 3K meteor showers CSV Home Planet
mktsymbols.txt 4M market symbols tab-separated file
movies-hero.txt 44K movies with hero in the title / delimited IMDb
movies-mpaa.txt 6M movies rated by the MPAA / delimited IMDb
movies-top-grossing.txt 152K top-grossing movies / delimited IMDb
contiguous-usa.dat 642 adjacencies between contiguous US and DC vertex pairs Stanford GraphBase
usa13509.txt 319K latitude and longitude of 13,509 cities in US latitude, longitude pairs TSPLIB
leipzig/leipzig100k.txt 12M 100K random sentences one sentence per line Leipzig Corpora
leipzig/leipzig300k.txt 37M 300K random sentences one sentence per line Leipzig Corpora
leipzig/leipzig1m.txt 124M 1 million random sentences one sentence per line Leipzig Corpora
words.txt 164K 20,068 words one word per line
wordlist.txt 2M 224,714 words one word per line
words.utf-8.txt 6M 645,288 words one word per line
words.shakespeare.txt 228K words in the complete works of Shakespeare one word per line
ospd.txt 600K official Scrabble player's dictionary one word per line
enable1.txt 2M ENABLE word list one word per line
web2.txt 2M Webster's NI2 dictionray one word per line Webster
1000words.txt 7K 1000 most common words one word per line
words5-knuth.txt 34K 5757 five-letter words one word per line Stanford GraphBase
stopwords.txt 3K words ignored in Wikipedia search one word per line MySQL
commonwords.txt 784K 74,202 common terms one term per line
california-gov.txt 2K list of candidadtes for California governor one candidate per line
fips55-all.txt 29M codes for named populated places fixed-width fields FIPS 55-3
fips55-pa.txt 2M codes for named populated places fixed-width fields FIPS 55-3
bostonmetro.txt 5K Boston Metro fixed-width fields MIT
latlng.txt 1M latitude and longitude of 25,000+ places in US fixed-width fields Tiger
dblp.xml.gz 125M computer science bibliographies XML DBLP
mthesaur.txt 24M Moby Thesaurus csv Project Gutenberg
papers.lst 12M computer science papers text Joel Seiferas

Internet movie database.

State boundaries by county.

Presidential election results.