I was doing some Web popularity research and found very cool data set collected by Philipp Lenssen back in 2006 and 2003. This is basically Google page count for 27000 English vocabulary words.
I decided to repeat the process on a wider word set via at least two search engines (Google and Live Search). So I combined Philipp's 27000+ vocabulary with Wiktionary (a wiki-based open content dictionary) English index and got quite comprehensive 74000+ vocabulary which reflects contemporary English language usage on the net. And then I collected page count number for each word reported by Google and Live Search.
And here are some visualizations. Unfortunately while Swivel can do do great interactive visualizations including clouds, they only support static graph for embedding. So don't hesitate to click on the graphs to see a better visualization (e.g. cloud for 100 top words).
Top 30 most popular words by Google, Live (numbers are in billions):
As expected, top is occupied by common English words and common internet related nouns.
Top 30 most popular words by Google vs Live:
Top 30 gainers (Google, 2006 to 2008). Good to see x 48 page count gain for "twitter", the rest I cannot explain. Can you?
oracular | x 163.6 |
planchette | x 153.7 |
newsy | x 93.5 |
posse | x 81.7 |
nymphet | x 75.2 |
jewelelry | x 65.6 |
twitter | x 48.6 |
paling | x 48.2 |
waylain | x 45.2 |
outmatch | x 45.2 |
outrode | x 41.6 |
pod | x 41.0 |
phizog | x 35.6 |
sinology | x 29.9 |
overdrew | x 26.7 |
multistorey | x 26.5 |
nonstick | x 25.6 |
nun | x 25.4 |
pedicure | x 24.8 |
pillory | x 24.8 |
panty | x 24.3 |
outridden | x 24.0 |
nip | x 23.2 |
naturism | x 23.2 |
organddy | x 23.0 |
piccolo | x 22.0 |
paladin | x 21.6 |
notability | x 21.2 |
breadthways | x 20.9 |
And finally top 10 the longest words along with page count (Google, 2008):
<w c="1460">tetaumatawhakatangihangakoauaotamateaurehaeaturipukapihimaungahoronukupokaiwhenuaakitanarahu</w>
<w c="5620">taumatawhakatangihangakoauauotamateaturipukakapikimaungahoronukupokaiwhenuakitanatahu</w>
<w c="60">methionylglutaminylarginyltyrosylglutamylserylleucylphenylalanylal...serine</w>
<w c="62300">llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch</w>
<w c="20100">taumatawhakatangihangakoauauotamateapokaiwhenuakitanatahu</w>
<w c="285">aequeosalinocalcalinosetaceoaluminosocupreovitriolic</w>
<w c="69000">pneumonoultramicroscopicsilicovolcanoconiosis</w>
<w c="1010">hepaticocholangiocholecystenterostomies</w>
<w c="18">hepaticocholangiocholecystenterostomy</w>
<w c="74500">hippopotomonstrosesquippedaliophobia</w>
Unsurprisingly, the longest word is still 92 letters long name of a hill in New Zealand, this one is hard to beat.
The raw data sets (page count for 74000+ words) are available in XML format and also on Swivel (Google version, Live version) where you can play with them visualizing and comparing in your way. Any more interesting visualization or comparison for this data set can you came up with? Enjoy.
Recent Comments