Real-World Data Sets
Here is a list of real-world data sets collected from the web.
| FILE | SIZE | DESCRIPTION | FORMAT | SOURCE |
|---|---|---|---|---|
| dickens.txt | 28M | nearly the complete works of Charles Dickens | text file | Project Gutenberg |
| magna-carta.txt | 78K | Magna Carta | text file | Project Gutenberg |
| war+peace.txt | 3M | War and Peace | text file | Project Gutenberg |
| chromosome4.txt | 10M | human chromosome 4 | text file | Project Gutenberg |
| chromosome11.txt | 7M | human chromosome 11 | text file | Project Gutenberg |
| ecoli.txt | 4M | ecoli genome | text file | Project Gutenberg |
| world192.txt | 2M | World Factbook 1992 | text file | Project Gutenberg |
| tale.txt | 779K | Tale of Two Cities | text file | Project Gutenberg |
| TomSawyer.txt | 406K | Tom Sawyer | text file | Project Gutenberg |
| bible.txt | 4M | The Bible | text file | |
| mobydick.txt | 1M | Moby Dick | text file | |
| aesop.txt | 187K | Aesop's Fables | text file | |
| manifesto.txt | 71K | Communist Manifesto | text file | |
| lilwomen.txt | 1018K | Little Women | text file | |
| muchado.txt | 121K | Much Ado About Nothing | text file | |
| amendments.txt | 18K | amendments to constitution | text file | |
| bush-kerry1.txt | 82K | Bush-Kerry debate 1 | text file | debates.org |
| bush-kerry2.txt | 92K | Bush-Kerry debate 2 | text file | debates.org |
| bush-kerry3.txt | 85K | Bush-Kerry debate 3 | text file | debates.org |
| obama-mccain1.txt | 90K | Obama-McCain debate 1 | text file | debates.org |
| obama-mccain2.txt | 88K | Obama-McCain debate 2 | text file | debates.org |
| obama-mccain3.txt | 85K | Obama-McCain debate 3 | text file | debates.org |
| pi-10million.txt | 10M | first 10 million digits of pi | text file | |
| pi-1million.txt | 977K | first 1 million digits of pi | text file | |
| elements.csv | 5K | periodic table of elements | CSV | |
| surnames.csv | 9M | 151,671 surnames by race/ethnicity | CSV | 2000 US Census |
| ip-by-country.csv | 5M | IP address ranges by country | CSV | MaxMind |
| dma.csv | 4K | designated market area code | CSV | MaxMind |
| misspellings.csv | 46K | common misspellings | CSV | Wikipeda |
| starbucks.csv | 619K | latitude and longitude of Starbucks | CSV | POI Factory |
| wendys.csv | 538K | latitude and longitude of Wendys | CSV | POI Factory |
| mcdonalds.csv | 1M | latitude and longitude of McDonalds | CSV | POI Factory |
| burgerking.csv | 662K | latitude and longitude of Burger Kings | CSV | POI Factory |
| walmart.csv | 468K | latitude and longitude of Walmarts | CSV | POI Factory |
| homedepot.csv | 216K | latitude and longitude of Home Depots | CSV | POI Factory |
| dairyqueen.csv | 459K | latitude and longitude of Dairy Queens | CSV | POI Factory |
| pizzahut.csv | 551K | latitude and longitude of Pizza Huts | CSV | POI Factory |
| zips1990.csv | 2M | latitude and longitude by zip code in 1990 | CSV | Tiger |
| zips1990-full.csv | 2M | latitude and longitude by zip code in 1990 | CSV | Tiger |
| zips1999.tsv | 2M | latitude and longitude by zip code in 1999 | CSV | Ben Fry |
| zips2000.csv | 964K | latitude and longitude by zip code in 2000 | CSV | Tiger |
| calories.csv | 43K | calories for various food items | CSV | |
| DJIA.csv | 1M | Dow Jones Industrial Average | CSV | |
| olympic-medals2012.csv | 49K | Olympic medals at London 2012 Olympics | CSV | |
| McGuireAFB.csv | 278K | temperature at McGuires Air Force Base | two column | Robert Vanderbei |
| morse.csv | 240 | Morse code | CSV | |
| amino.csv | 1K | amino acids | CSV | |
| names.csv | 103K | names and their meanings | CSV | |
| codes.csv | 820 | states and FIPS codes | CSV | Tiger |
| phone-na.csv | 28K | North American telephone codes | CSV | |
| phone-international.csv | 3K | international telephone codes | CSV | |
| airports.csv | 5K | airport codes | CSV | |
| psychiatric.csv | 18K | psychiatric disorders and DSM codes | CSV | allpsych.com |
| fortune500.csv | 1018K | Fortune 500 companies | CSV | |
| fortune1000.csv | 23K | Fortune 1000 companies | CSV | |
| language.csv | 196K | common words translated in 15 languaages | CSV | |
| synsets.txt | 8M | Wordnet synonym sets | CSV | WordNet |
| hypernyms.txt | 952K | Wordnet hypernym relations | CSV | WordNet |
| countries.csv | 7K | countries, capitals, and country codes | CSV | ubuntu.com |
| iso3166.csv | 4K | ISO 3166 country codes | CSV | MaxMind |
| fips10_4.csv | 73K | FIPS 10-4 subcountry codes | CSV | MaxMind |
| bnc-wordfreq.csv | 122K | frequency of words in British National Corpus | CSV | Adam Kilgarriff |
| upc-glns.csv | 5M | manufacturers by 13-digit GLN | CSV | upcdatabase.com |
| upc-items.csv | 58M | items by UPC code | CSV | upcdatabase.com |
| sdss174052.csv | 20M | .1% of Sloan Digital Sky galaxy objects | CSV | Sloan Digital Sky |
| sdss1738478.csv | 201M | 1% of Sloan Digital Sky galaxy objects | CSV | Sloan Digital Sky |
| sdss6949386.csv | 804M | 4% of Sloan Digital Sky galaxy objects | CSV | Sloan Digital Sky |
| comets.csv | 3K | comets | CSV | Home Planet |
| meteors.csv | 3K | meteor showers | CSV | Home Planet |
| mktsymbols.txt | 4M | market symbols | tab-separated file | |
| movies-hero.txt | 44K | movies with hero in the title | / delimited | IMDb |
| movies-mpaa.txt | 6M | movies rated by the MPAA | / delimited | IMDb |
| movies-top-grossing.txt | 152K | top-grossing movies | / delimited | IMDb |
| contiguous-usa.dat | 642 | adjacencies between contiguous US and DC | vertex pairs | Stanford GraphBase |
| usa13509.txt | 319K | latitude and longitude of 13,509 cities in US | latitude, longitude pairs | TSPLIB |
| leipzig/leipzig100k.txt | 12M | 100K random sentences | one sentence per line | Leipzig Corpora |
| leipzig/leipzig300k.txt | 37M | 300K random sentences | one sentence per line | Leipzig Corpora |
| leipzig/leipzig1m.txt | 124M | 1 million random sentences | one sentence per line | Leipzig Corpora |
| words.txt | 164K | 20,068 words | one word per line | |
| wordlist.txt | 2M | 224,714 words | one word per line | |
| words.utf-8.txt | 6M | 645,288 words | one word per line | |
| words.shakespeare.txt | 228K | words in the complete works of Shakespeare | one word per line | |
| ospd.txt | 600K | official Scrabble player's dictionary | one word per line | |
| enable1.txt | 2M | ENABLE word list | one word per line | |
| web2.txt | 2M | Webster's NI2 dictionray | one word per line | Webster |
| 1000words.txt | 7K | 1000 most common words | one word per line | |
| words5-knuth.txt | 34K | 5757 five-letter words | one word per line | Stanford GraphBase |
| stopwords.txt | 3K | words ignored in Wikipedia search | one word per line | MySQL |
| commonwords.txt | 784K | 74,202 common terms | one term per line | |
| california-gov.txt | 2K | list of candidadtes for California governor | one candidate per line | |
| fips55-all.txt | 29M | codes for named populated places | fixed-width fields | FIPS 55-3 |
| fips55-pa.txt | 2M | codes for named populated places | fixed-width fields | FIPS 55-3 |
| bostonmetro.txt | 5K | Boston Metro | fixed-width fields | MIT |
| latlng.txt | 1M | latitude and longitude of 25,000+ places in US | fixed-width fields | Tiger |
| dblp.xml.gz | 125M | computer science bibliographies | XML | DBLP |
| mthesaur.txt | 24M | Moby Thesaurus | csv | Project Gutenberg |
| papers.lst | 12M | computer science papers | text | Joel Seiferas |
Internet movie database.
State boundaries by county.
Presidential election results.