Real-World Data Sets
Here is a list of real-world data sets collected from the web.
FILE | SIZE | DESCRIPTION | FORMAT | SOURCE |
---|---|---|---|---|
dickens.txt | 28M | nearly the complete works of Charles Dickens | text file | Project Gutenberg |
magna-carta.txt | 78K | Magna Carta | text file | Project Gutenberg |
war+peace.txt | 3M | War and Peace | text file | Project Gutenberg |
chromosome4.txt | 10M | human chromosome 4 | text file | Project Gutenberg |
chromosome11.txt | 7M | human chromosome 11 | text file | Project Gutenberg |
ecoli.txt | 4M | ecoli genome | text file | Project Gutenberg |
world192.txt | 2M | World Factbook 1992 | text file | Project Gutenberg |
tale.txt | 779K | Tale of Two Cities | text file | Project Gutenberg |
TomSawyer.txt | 406K | Tom Sawyer | text file | Project Gutenberg |
bible.txt | 4M | The Bible | text file | |
mobydick.txt | 1M | Moby Dick | text file | |
aesop.txt | 187K | Aesop's Fables | text file | |
manifesto.txt | 71K | Communist Manifesto | text file | |
lilwomen.txt | 1018K | Little Women | text file | |
muchado.txt | 121K | Much Ado About Nothing | text file | |
amendments.txt | 18K | amendments to constitution | text file | |
bush-kerry1.txt | 82K | Bush-Kerry debate 1 | text file | debates.org |
bush-kerry2.txt | 92K | Bush-Kerry debate 2 | text file | debates.org |
bush-kerry3.txt | 85K | Bush-Kerry debate 3 | text file | debates.org |
obama-mccain1.txt | 90K | Obama-McCain debate 1 | text file | debates.org |
obama-mccain2.txt | 88K | Obama-McCain debate 2 | text file | debates.org |
obama-mccain3.txt | 85K | Obama-McCain debate 3 | text file | debates.org |
pi-10million.txt | 10M | first 10 million digits of pi | text file | |
pi-1million.txt | 977K | first 1 million digits of pi | text file | |
elements.csv | 5K | periodic table of elements | CSV | |
surnames.csv | 9M | 151,671 surnames by race/ethnicity | CSV | 2000 US Census |
ip-by-country.csv | 5M | IP address ranges by country | CSV | MaxMind |
dma.csv | 4K | designated market area code | CSV | MaxMind |
misspellings.csv | 46K | common misspellings | CSV | Wikipeda |
starbucks.csv | 619K | latitude and longitude of Starbucks | CSV | POI Factory |
wendys.csv | 538K | latitude and longitude of Wendys | CSV | POI Factory |
mcdonalds.csv | 1M | latitude and longitude of McDonalds | CSV | POI Factory |
burgerking.csv | 662K | latitude and longitude of Burger Kings | CSV | POI Factory |
walmart.csv | 468K | latitude and longitude of Walmarts | CSV | POI Factory |
homedepot.csv | 216K | latitude and longitude of Home Depots | CSV | POI Factory |
dairyqueen.csv | 459K | latitude and longitude of Dairy Queens | CSV | POI Factory |
pizzahut.csv | 551K | latitude and longitude of Pizza Huts | CSV | POI Factory |
zips1990.csv | 2M | latitude and longitude by zip code in 1990 | CSV | Tiger |
zips1990-full.csv | 2M | latitude and longitude by zip code in 1990 | CSV | Tiger |
zips1999.tsv | 2M | latitude and longitude by zip code in 1999 | CSV | Ben Fry |
zips2000.csv | 964K | latitude and longitude by zip code in 2000 | CSV | Tiger |
calories.csv | 43K | calories for various food items | CSV | |
DJIA.csv | 1M | Dow Jones Industrial Average | CSV | |
olympic-medals2012.csv | 49K | Olympic medals at London 2012 Olympics | CSV | |
McGuireAFB.csv | 278K | temperature at McGuires Air Force Base | two column | Robert Vanderbei |
morse.csv | 240 | Morse code | CSV | |
amino.csv | 1K | amino acids | CSV | |
names.csv | 103K | names and their meanings | CSV | |
codes.csv | 820 | states and FIPS codes | CSV | Tiger |
phone-na.csv | 28K | North American telephone codes | CSV | |
phone-international.csv | 3K | international telephone codes | CSV | |
airports.csv | 5K | airport codes | CSV | |
psychiatric.csv | 18K | psychiatric disorders and DSM codes | CSV | allpsych.com |
fortune500.csv | 1018K | Fortune 500 companies | CSV | |
fortune1000.csv | 23K | Fortune 1000 companies | CSV | |
language.csv | 196K | common words translated in 15 languaages | CSV | |
synsets.txt | 8M | Wordnet synonym sets | CSV | WordNet |
hypernyms.txt | 952K | Wordnet hypernym relations | CSV | WordNet |
countries.csv | 7K | countries, capitals, and country codes | CSV | ubuntu.com |
iso3166.csv | 4K | ISO 3166 country codes | CSV | MaxMind |
fips10_4.csv | 73K | FIPS 10-4 subcountry codes | CSV | MaxMind |
bnc-wordfreq.csv | 122K | frequency of words in British National Corpus | CSV | Adam Kilgarriff |
upc-glns.csv | 5M | manufacturers by 13-digit GLN | CSV | upcdatabase.com |
upc-items.csv | 58M | items by UPC code | CSV | upcdatabase.com |
sdss174052.csv | 20M | .1% of Sloan Digital Sky galaxy objects | CSV | Sloan Digital Sky |
sdss1738478.csv | 201M | 1% of Sloan Digital Sky galaxy objects | CSV | Sloan Digital Sky |
sdss6949386.csv | 804M | 4% of Sloan Digital Sky galaxy objects | CSV | Sloan Digital Sky |
comets.csv | 3K | comets | CSV | Home Planet |
meteors.csv | 3K | meteor showers | CSV | Home Planet |
mktsymbols.txt | 4M | market symbols | tab-separated file | |
movies-hero.txt | 44K | movies with hero in the title | / delimited | IMDb |
movies-mpaa.txt | 6M | movies rated by the MPAA | / delimited | IMDb |
movies-top-grossing.txt | 152K | top-grossing movies | / delimited | IMDb |
contiguous-usa.dat | 642 | adjacencies between contiguous US and DC | vertex pairs | Stanford GraphBase |
usa13509.txt | 319K | latitude and longitude of 13,509 cities in US | latitude, longitude pairs | TSPLIB |
leipzig/leipzig100k.txt | 12M | 100K random sentences | one sentence per line | Leipzig Corpora |
leipzig/leipzig300k.txt | 37M | 300K random sentences | one sentence per line | Leipzig Corpora |
leipzig/leipzig1m.txt | 124M | 1 million random sentences | one sentence per line | Leipzig Corpora |
words.txt | 164K | 20,068 words | one word per line | |
wordlist.txt | 2M | 224,714 words | one word per line | |
words.utf-8.txt | 6M | 645,288 words | one word per line | |
words.shakespeare.txt | 228K | words in the complete works of Shakespeare | one word per line | |
ospd.txt | 600K | official Scrabble player's dictionary | one word per line | |
enable1.txt | 2M | ENABLE word list | one word per line | |
web2.txt | 2M | Webster's NI2 dictionray | one word per line | Webster |
1000words.txt | 7K | 1000 most common words | one word per line | |
words5-knuth.txt | 34K | 5757 five-letter words | one word per line | Stanford GraphBase |
stopwords.txt | 3K | words ignored in Wikipedia search | one word per line | MySQL |
commonwords.txt | 784K | 74,202 common terms | one term per line | |
california-gov.txt | 2K | list of candidadtes for California governor | one candidate per line | |
fips55-all.txt | 29M | codes for named populated places | fixed-width fields | FIPS 55-3 |
fips55-pa.txt | 2M | codes for named populated places | fixed-width fields | FIPS 55-3 |
bostonmetro.txt | 5K | Boston Metro | fixed-width fields | MIT |
latlng.txt | 1M | latitude and longitude of 25,000+ places in US | fixed-width fields | Tiger |
dblp.xml.gz | 125M | computer science bibliographies | XML | DBLP |
mthesaur.txt | 24M | Moby Thesaurus | csv | Project Gutenberg |
papers.lst | 12M | computer science papers | text | Joel Seiferas |
Internet movie database.
State boundaries by county.
Presidential election results.