Data is beautiful and I have recently been playing with making graphs using Google’s Ngram Viewer. It’s a pretty interesting proposition: Google generates graphs based on the frequency with which a word appears in Google Books, and the assertion is that you can use the Ngram Viewer to trace word usage. There has been some interesting criticism of this tool, but I think the best critique I’ve read was on StackExchange. There are known concerns about the accuracy of Google’s metadata and how easy it is to make inaccurate graphs (hyphenation! capitalization!). I’d like to add one more concern to the list: it is missing the most interesting data. Ngram Viewer assumes that Google Books’ catalog is a representative sampling of current publications and that books are representative of current language use.

Google Books has been embroiled in legal battles over copyright and fair use. So, I think it is fair to assume that Google Books has significant omissions in its catalog. Those omissions are probably recent works, since living authors and known copyright holders are suing Google. Google Books also disproportionally favors English language publications and scholarly works — so it likely does not have a representative or current set of publications for generating these graphs.

As a data nerd, I always want to know if a trend is still persisting. I thought it was really weird that Ngram Viewer starts the user off with a demo graph that spans from 1800 to 2000; the demo graph is for the frequency of “Albert Einstein,” “Sherlock Holmes,” and “Frankenstein” in English language publications : / This got me thinking about time range. I think the most exciting ideas, and therefore shifts in language, are now self-published. Self-publishing can happen so easily and in so many forms, especially online: forums, blogs, fan fiction, zines, self-published books. Heck, my mom spent 18 months writing and researching a book about the history of Quaker meetinghouses in Bucks County Pennsylvania. She self-published it and got an ISBN number … the nerd apple does not fall far from the tree. I doubt that her book will ever make it into Google Books, though many Quaker libraries in the Philadelphia area have it in their collection.

Google Ngram Viewer, while interesting, is missing some of the most exciting and current data. Still, you should take it for a ride — it’s pretty fun toy. Check out these entertaining Ngram Viewer graphs.