Category Archives: News

Google Unleashes Mammoth Database of 500 Billion Words

With little fanfare, Google has made a mammoth database culled from nearly 5.2 million digitized books available to the public for free downloads and online searches, opening a landscape of possibilities for research and education in the humanities.

The digital storehouse, which comprises words and short phrases as well as a year-by-year count of how often they appear, represents the first time a data set of this magnitude and searching tools are at the disposal of Ph.D.s, middle school students and anyone else who likes to spend time in front of a small screen. It consists of the 500 billion words that are contained in books published between 1800 and 2000 in English, French, Spanish, German, Chinese, Russian and Hebrew.

The intended audience is scholarly, but a simple online tool also allows anyone with a computer to plug in a string of up to five words and see a graph that charts the phrase’s use over time — a diversion that can quickly become as addictive as the habit-forming video game “Angry Birds.”

With a click you can see that “women,” in comparison with “men,” is rarely mentioned until the early 1970s, when feminism gained a foothold. The two lines, google-logomoving in opposite directions, finally cross paths in about 1986.

You can also learn that Mickey Mouse and Marilyn Monroe don’t get nearly as much attention in print as Jimmy Carter; compare the many more references in English than in Chinese to “Tiananmen Square” after 1989; or follow how “grilling” began a climb in the late 1990s until it outpaced “roasting,” ”baking” and “frying” in 2004.

”The goal is to give an 8-year-old the ability to browse cultural trends throughout history, as recorded in books,” said Erez Lieberman Aiden, a junior fellow at the Society of Fellows at Harvard.

Lieberman Aiden and Jean-Baptiste Michel, a postdoctoral fellow at Harvard, assembled the data set with Google and spearheaded a research project to demonstrate how vast digital databases can transform our understanding of language, culture and the flow of ideas.

Their study, to be published in the journal Science on Friday, offers a tantalizing taste of the rich buffet of research opportunities now open to literature, history and other liberal arts professors who may have previously avoided quantitative analysis. Science is taking the unusual step of making the paper available online to nonsubscribers.

”We wanted to show what becomes possible when you apply very high-turbo data analysis to questions in the humanities,” said Lieberman Aiden, whose expertise is in applied mathematics and genomics.

He called the method “culturomics.” The data set can be downloaded, and users can build their own search tools.

With the most powerful version of the data set, the researchers measured the endurance of fame, finding that written references to celebrities faded twice as quickly in the mid-20th century as they did in the early 19th.

“In the future everyone will be famous for 7.5 minutes,” they write.

Looking at inventions, they discovered that technological advances took, on average, 66 years to be adopted by the larger culture in the early 1800s and only 27 years in the period between 1880 and 1920.

They tracked how eccentric English verbs that did not add “ed” at the end for past tense (i.e., “learnt”) evolved to conform to the common pattern (“learned”). They figured that the English lexicon had grown by 70 percent to more than 1 million words in the past 50 years and they demonstrated how dictionaries could be updated much more rapidly by pinpointing newly popular words and obsolete words.

Steven Pinker, a linguist at Harvard who collaborated on the Science paper’s section about language evolution, has been studying changes in grammar and past tense forms for 20 years.

”When I saw they had this database, I was quite energized,” he said. “There is so much ignorance. We’ve had to speculate what might have happened to the language.”

The information about verb changes “makes the results more convincing and more complete,” Pinker added. “What we report in this paper is just the beginning.”

Despite the frequent resistance to quantitative analysis in some corners of the humanities, Pinker said he was confident that the use of this and similar tools would “become universal.”

Reactions from humanities scholars who quickly reviewed the article were more muted.

”In general it’s a great thing to have,” Louis Menand, an English professor at Harvard, said, particularly for linguists. But he cautioned that in the realm of cultural history, “obviously some of the claims are a little exaggerated.”

He was also troubled that, among the paper’s 13 named authors, there was not a single humanist involved.

”There’s not even a historian of the book connected to the project,” Menand noted.

Alan Brinkley, the former provost at Columbia University and a professor of American history, said it was too early to tell what the impact of word and phrase searches would be.

“I could imagine lots of interesting uses, I just don’t know enough about what they’re trying to do statistically,” he said.

Aware of concerns raised by humanists that the essence of their art is a search for meaning, both Michel and Lieberman Aiden emphasized that culturomics simply provided information. Interpretation remains essential.

”I don’t want humanists to accept any specific claims — we’re just throwing a lot of interesting pieces on the table,” Lieberman Aiden said. “The question is: Are you willing to examine this data?”

Michel and Lieberman Aiden initially started their research in 2004 on irregular verbs. Google Books did not exist then, and they had to scrutinize stacks of Anglo-Saxon texts page by page.

The process took 18 months.

”We were exhausted,” Lieberman Aiden said. The project “was a total Hail Mary pass; we could have collected this data set and proved nothing.”

Then they read about Google’s plan to create a digital library and store of every book ever published and recognized that it could revolutionize their research. They approached Peter Norvig, the director of research at Google, about using the collection to do statistical analyses.

”He realized this was a great opportunity for science and for Google,” Michel said. “We spent the next four years dealing with the many, many complicated issues that arose,” including legal complications and computational constraints. (A proposed class-action settlement pertaining to copyright and compensation brought by writers and publishers as a result of Google’s digitization plans is pending in the courts.)

Google says the culturomics project raises no copyright issue because the books themselves or even sections of them cannot be read.

So far, Google has scanned more than 11 percent of the entire corpus of published books, about 2 trillion words. The data analyzed in the Science article contains about 4 percent of the corpus.

The giant compendium of words makes it possible to analyze cultural influences statistically in a way that was not possible before. Cultural references tend to appear in print much less frequently than everyday words, said Michel, whose expertise is in applied math and systems biology.

To be confident that you are getting an accurate picture, you need a huge sample. Checking if “Sasquatch” has infiltrated the culture requires a warehouse of at least 1 billion words a year, he said.

As for culturomics? In 20 years, type the word into an updated version of this database and see what happens.

Click Here to See the Source of the Information.

Technology is Opening Up New Ways for Companies to Interact With Clients, Employees, Panelists Say

Social media tools like Facebook and Twitter could open up new ways for companies to interact with clients and gauge demand for the products they sell, panelists at the Louisiana Technology Council’s 6th annual CIO/CTO Forum said Friday in New Orleans.

Cliff Triplett, Vice President and Chief Information Officer of Baker Hughes Inc., said researchers at Hewlett Packard have already found that by scanning the Internet for chatter about soon-to-be-released movies, they can accurately predict the amount of money a film will take in on its opening weekend.

That opens up the possibility, according to Triplett, that companies could find other ways of using social media to gauge interest and demand for products they are preparing to roll out.

“Social media is becoming the way we communicate (and) sell,” Triplett said.

“I think it’s a huge issue (and) that we have many, many more turns of the crank before we understand the full implications of social media,” said Tony Scott, Corporate Vice President and Chief Information Officer at Microsoft. “We all have to learn how to leverage it.”

Even technology that falls outside the realm of social media will change the way companies interact with their customers and employees in the future.

Mechanisms such as instant messaging and voice video will make it possible to communicate more quickly with customers and reduce the amount of time it will take to make decisions, said Dennis Walsh, former Chief Technology Officer and currently a consultant to General Motors.

Technologies will also increasingly make it possible for employees to work from anywhere.

“This notion of ‘going’ to work has been tossed on its head,” Scott said.

But the convenience of always being plugged in to what’s happening at the office, regardless of where you are, cuts both ways, Walsh said.

“It’s a double-edged sword,” Walsh said. “It’s completely blurring the line between business and personal (time).”

Going forward, corporate information technology departments must evolve away from simply fixing computers and develop a new focus on helping to grow the business, the panelists said.

“We’ve got to dig deep and get involved in the very business processes” companies are pursuing, Scott said.

About 200 people attended Friday’s forum, held at the New Orleans Marriott on Canal Street.

Click Here for the Source of the Information.