About ravi

I am the Data Scientist at Ranker.com. I use my combination of database skills (13+ years of technology experience) and statistical training (PhD in Social Psychology from USC) to power recommendation and ranking algorithms at Ranker.

An Opinion Graph of the World’s Beers

One of the strengths of Ranker‘s data is that we collect such a wide variety of opinions from users that we can put opinions about a wide variety of subjects into a graph format.  Graphs are useful as they let you go beyond the individual relationships between items and see overall patterns.  In anticipation of Cinco de Mayo, I produced the below opinion graph of beers, based on votes on lists such as our Best World Beers list.  Connections in this graph represent significant correlations between sentiment towards connected beers, which vary in terms of strength.  A layout algorithm (force atlas in Gephi) placed beers that were more related closer to each other and beers that had fewer/weaker connections further apart.  I also ran a classification algorithm that clustered beers according to preference and colored the graph according to these clusters.  Click on the below graph to expand it.

Ranker's Beer Opinion Graph

One of the fun things about graphs is that different people will see different patterns.  Among the things I learned from this exercise are:

  • The opposite of light beer, from a taste perspective, isn’t dark beer.  Rather, light beers like Miller Lite are most opposite craft beers like Stone IPA and Chimay.
  • Coors light is the light beer that is closest to the mainstream cluster.  Stella Artois, Corona, and Heineken are also reasonable bridge beers between the main cluster and the light beer world.
  • The classification algorithm revealed six main taste/opinion clusters, which I would label: Really Light Beers (e.g. Natural Light), Lighter Mainstream Beers (e.g. Blue Moon), Stout Beers (e.g. Guinness), Craft Beers (e.g. Stone IPA), Darker European Beers (e.g. Chimay), and Lighter European Beers (e.g. Leffe Blonde).  The interesting parts about the classifications are the cases on the edge, such as how Newcastle Brown Ale appeals to both Guinness and Heineken drinkers.
  • Seeing beers graphed according to opinions made me wonder if companies consciously position their beers accordingly.  Is Pyramid Hefeweizen successfully appealing to the Sam Adams drinker who wants a bit of European flavor?  Is Anchor Steam supposed to appeal to both the Guinness drinker and the craft beer drinker?  I’m not sure if I know enough about the marketing of beers to know the answer to this, but I’d be curious if beer companies place their beers in the same space that this opinion graph does.

These are just a few observations based on my own limited beer drinking experience.  I tend to be more of a whiskey drinker, and hope more of you will vote on our Best Tasting Whiskey list, so I can graph that next.  I’d love to hear comments about other observations that you might make from this graph.

- Ravi Iyer

Incoming search terms:

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

Ranker Uses Big Data to Rank the World’s 25 Best Film Schools

NYU, USC, UCLA, Yale, Julliard, Columbia, and Harvard top the Rankings.

Does USC or NYU have a better film school?  “Big data” can provide an answer to this question by linking data about movies and the actors, directors, and producers who have worked on specific movies, to data about universities and the graduates of those universities.  As such, one can use semantic data from sources like Freebase, DBPedia, and IMDB to figure out which schools have produced the most working graduates.  However, what if you cared about the quality of the movies they worked on rather than just the quantity?  Educating a student who went on to work on The Godfather must certainly be worth more than producing a student who received a credit on Gigli.

Leveraging opinion data from Ranker’s Best Movies of All-Time list in addition to widely available semantic data, Ranker recently produced a ranked list of the world’s 25 best film schools, based on credits on movies within the top 500 movies of all-time.  USC produces the most film credits by graduates overall, but when film quality is taken into account, NYU (208 credits) actually produces more credits among the top 500 movies of all-time, compared to USC (186 credits).  UCLA, Yale, Julliard, Columbia, and Harvard take places 3 through 7 on the Ranker’s list.  Several professional schools that focus on the arts also place in the top 25 (e.g. London’s Royal Academy of Dramatic Art) as well as some well-located high schools (New York’s Fiorello H. Laguardia High School & Beverly Hills High School).

The World’s Top 25 Film Schools

  1. New York University (208 credits)
  2. University of Southern California (186 credits)
  3. University of California – Los Angeles (165 credits)
  4. Yale University (110 credits)
  5. Julliard School (106 credits)
  6. Columbia University (100 credits)
  7. Harvard University (90 credits)
  8. Royal Academy of Dramatic Art (86 credits)
  9. Fiorello H. Laguardia High School of Music & Art (64 credits)
  10. American Academy of Dramatic Arts (51 credits)
  11. London Academy of Music and Dramatic Art (51 credits)
  12. Stanford University (50 credits)
  13. HB Studio (49 credits)
  14. Northwestern University (47 credits)
  15. The Actors Studio (44 credits)
  16. Brown University (43 credits)
  17. University of Texas – Austin (40 credits)
  18. Central School of Speech and Drama (39 credits)
  19. Cornell University (39 credits)
  20. Guildhall School of Music and Drama (38 credits)
  21. University of California – Berkeley (38 credits)
  22. California Institute of the Arts (38 credits)
  23. University of Michigan (37 credits)
  24. Beverly Hills High School (36 credits)
  25. Boston University (35 credits)

“Clearly, there is a huge effect of geography, as prominent New York and Los Angeles based high schools appear to produce more graduates who work on quality films compared to many colleges and universities,“ says Ravi Iyer, Ranker’s Principal Data Scientist, a graduate of the University of Southern California.

Ranker is able to combine factual semantic data with an opinion layer because Ranker is powered by a Virtuoso triple store with over 700 million triples of information that are processed into an entertaining list format for users on Ranker’s consumer facing website, Ranker.com.  Each month, over 7 million unique users interact with this data – ranking, listing and voting on various objects – effectively adding a layer of opinion data on top of the factual data from Ranker’s triple store. The result is a continually growing opinion graph that connects factual and opinion data.  As of January 2013, Ranker’s opinion graph included over 30,000 nodes with over 5 million edges connecting these nodes.

- Ravi Iyer

Incoming search terms:

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

Predicting Box Office Success a Year in Advance from Ranker Data

A number of data scientists have attempted to predict movie box office success from various datasets.  For example, researchers at HP labs were able to use tweets around the release date plus the number of theaters that a movie was released in to predict 97.3% of movie box office revenue in the first weekend.  The Hollywood Stock Exchange, which lets participants bet on the box office revenues and infers a prediction, predicts 96.5% of box office revenue in the opening weekend.  Wikipedia activity predicts 77% of box office revenue according to a collaboration of European researchers.  Ranker runs lists of anticipated movies each year, often for more than a year in advance, and so the question I wanted to analyze in our data was how predictive is Ranker data of box office success.

However, since the above researchers have already shown that online activity at the time of the opening weekend predicts box office success during that weekend, I wanted to build upon that work and see if Ranker data could predict box office receipts well in advance of opening weekend.  Below is a simple scatterplot of results, showing that Ranker data from the previous year predicts 82% of variance in movie box office revenue for movies released in the next year.

Predicting Box Office Success from Ranker Data
Predicting Box Office Success from Ranker Data

The above graph uses votes cast in 2011 to predict revenues from our Most Anticipated 2012 Films list.  While our data is not as predictive as twitter data collected leading up to opening weekend, the remarkable thing about this result is that most votes (8,200 votes from 1,146 voters) were cast 7-13 months before the actual release date.  I look forward to doing the same analysis on our Most Anticipated 2013 Films list at the end of this year.

- Ravi Iyer

Incoming search terms:

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

Crowdsourcing Objective Answers to Subjective Questions – Nerd Nite Los Angeles

A lot of the questions on Ranker are subjective, but that doesn’t mean that we cannot use data to bring some objectivity to this analysis.  In the same way that Yelp crowdsources answers to subjective questions about restaurants and TripAdvisor crowdsources answers to subjective questions about hotels, Ranker crowdsources answers to a broader assortment of relatively subjective questions such as the Tastiest Pizza Toppings, the Best Cruise Destination, and the Worst Way to Die.

A few weeks ago, I did an informal talk on the Wisdom of Crowds approach that Ranker takes to crowdsource such answers at a Los Angeles bar as part of “Nerd Nite”.  The gist of it is that one can crowdsource objective answers to subjective questions by asking diverse groups of people questions in diverse ways.  Greater diversity, when aggregated effectively, enables the error inherent in answering any subjective question to be minimized.  For example, we know intuitively that relying on only the young or only the elderly or only people in cities or only people who live in rural areas gives us biased answers to subjective questions.  But when all of these diverse groups agree on a subjective question, there is reason to believe that there is an objective truth that they are responding to.  Below is the video of that talk.

If you want to see a more formal version of this talk, I’ll be speaking at greater length on Ranker’s methodologies at the Big Data Innovation Summit in San Francisco this Friday.

- Ravi Iyer

Incoming search terms:

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

The Opinion Graph predicts more than the Interest Graph

At Ranker, we keep track of talk about the “interest graph” as we have our own parallel graph of relationships between objects in our system, that we call an “opinion graph”.  I was recently sent this video concerning the power of the interest graph to drive personalization.

The points made in the video are very good, about how the interest graph is more predictive than the social graph, as far as personalization goes.  I love my friends, but the kinds of things they read and the kinds of things I read are very different and while there is often overlap, there is also a lot of diversity.  For example, trying to personalize my movie recommendations based on my wife’s tastes would not be a satisfying experience.  Collaborative filtering using people who have common interests with me is a step in the right direction and the interest graph is certainly an important part of that.

However, you can predict more about a person with an opinion graph versus an interest graph. The difference is that while many companies can infer from web behavior what people are interested in, perhaps by looking at the kinds of articles and websites they consume, a graph of opinions actually knows what people think about the things they are reading about.  Anyone who works with data knows that the more specific a data point is, the more you can predict, as the amount of “error” in your measurement is reduced.  Reduced measurement error is far more important for prediction than sample size, which is a point that gets lost in the drive toward bigger and bigger data sets.  Nate Silver often makes this point in talks and in his book.

For example, if you know someone reads articles about Slumdog Millionare, then you can serve them content about Slumdog Millionare.  That would be a typical use case for interest graph data. Using collaborative filtering, you can find out what other Slumdog Millionare fans like and serve them appropriate content.  With opinion graph data, of the type we collect at Ranker, you might be able to differentiate between a person who thinks that Slumdog Millionare is simply a great movie versus someone who thinks the soundtrack was one of the best ever.  If you liked the movie, we would predict that you would also like Fight Club.  But if you liked the soundtrack, you might instead be interested in other music by A.R. Rahman.

Simply put, the opinion graph can predict more about people than the interest graph can.

- Ravi Iyer

Incoming search terms:

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

Mitt Romney Should Have Advertised on the X-Files

With the election recently behind us, many political analysts are conducting analyses of the campaigns, examining what worked and what didn’t.  One specific area where the Obama team is getting praise is in their unprecedented use of data to drive campaign decisions, and even more specifically, how they used data to micro-target fans who watched specific TV shows.  From this New York Times article concerning the Obama Team’s TV analytics:

“Culling never-before-used data about viewing habits, and combining it with more personal information about the voters the campaign was trying to reach and persuade than was ever before available, the system allowed Mr. Obama’s team to direct advertising with a previously unheard-of level of efficiency, strategists from both sides agree….

[They] created a new set of ratings based on the political leanings of categories of people the Obama campaign was interested in reaching, allowing the campaign to buy its advertising on political terms as opposed to traditional television industry terms…..

[They focused] on niche networks and programs that did not necessarily deliver large audiences but, as Mr. Grisolano put it, did provide the right ones.”

 

The Obama team focused more on undecided/apolitical voters in an effort to get them to the polls.  Given that some Mitt Romney supporters have blamed a lack of turnout of supporters for the results of the election, perhaps Romney would have been smart to have created a ranked list of TV shows, based on how much fans of the shows supported Romney, and then placed positive/motivating ads on those shows in an effort to increase turnout of his base.  Where would Romney get such data?  From Ranker!

Mitt Romney is on many votable Ranker lists (e.g. Most Influential People of 2012) and based on people who voted on those lists and also lists such as our Best Recent TV Shows list, we can examine which TV shows are positively or negatively associated with Mitt Romney.  Below are the top positive results from one of our internal tools.

As you can see, the X-Files appears to be the highest correlated show, by a fair margin.  I don’t watch the X-Files, so I wasn’t sure why this correlation exists, but I did a bit of research, and found this article exploring how the X-Files supported a number of conservative themes, such as the persistence of evil, objective truth, and distrust of government (also see here).  The article points out that in one episode, right wing militiamen are depicted as being heroic, which never would happen in a more liberal leaning plot.  Perhaps if you are a conservative politician seeking to motivate your base, you should consider running ads on reruns of the X-Files, or if you run a television station that shows X-Files reruns, consider contacting your local conservative politicians leveraging this data.

You may notice that this list contains more classic/rerun shows (e.g. Leave it to Beaver) than current shows.  This appears to be part of a general trend where conservatives on Ranker tend to positively vote for classic TV, a subject we’ll cover in a future blog post.  The possibility of advertising on reruns is part of what we would like to highlight in this post, as ads are likely relatively cheap and audiences can be more easily targeted, a tactic which the Obama campaign has been praised for.  At Ranker, we’re hopeful that more advertisers will seek value in the long-tail and mid-tail and will seek to mimic the tactics of the Obama campaign, as our data is uniquely suited for such psychographic targeting.

- Ravi Iyer

Incoming search terms:

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

How Crowdsourcing can uncover Niche/Trending shows

At Ranker, people give us their opinions in various different ways. Some people vote.  Other people make long lists.  Still others make really short lists.  Some people tell us their absolute favorite things, while others list everything they’ve ever experienced.  One of the advantages of this diversity is that it allows us to examine patterns within these divergent types of opinions.  For example, some things are really popular, meaning that everyone lists them (e.g. Michael Jordan is on everyone’s best basketball players list).  Most popular things are also things that people generally list high on their lists and also get lots of positive votes (e.g. Michael Jordan).  However, there are some things that don’t get listed very often, but when they do get listed, people are passionate about them, meaning that they get listed high on people’s lists.  We highlight these items in our system using the niche symbol.

I’ve recently been examining our “niche” tag, which signifies when something is not particularly popular, but people are passionate about it.  There are many reasons why things can be niche.  Some things appeal specifically to younger (e.g. Rugrats) or older crowds (e.g.  The Rockford Files).  Other things have natural audiences (e.g.baseball fans who appreciate defense and think Ozzie Smith is one of the greatest players of all time).  The most interesting case is when something that I can’t identify starts showing on the niche list (see the current list here).

This is especially helpful for someone like me, who doesn’t always know what is ‘hot’ and naturally looks to data to find new quality entertainment.  Awhile back, the show Community consistently was showing highest on our niche algorithm.  Few people listed it as one of the best recent TV shows, but those who listed it tended to think very highly of it.  I was intruiged enough to watch the pilot on Hulu and have since become hooked.  Community has since graduated from our niche algorithm as it became popular.  Sometimes passion amongst a small group is how a trend starts.

As Margaret Mead believed that only a small group of citizens could change the world, so Malcolm Gladwell has shown how a small group of trendsetters can signal changes in pop culture.  Not everything on our niche list will become the next big thing, but it’s certainly a good place to search for candidates.

Among the things that people seem to be passionate about now, that aren’t so popular, are several good candidates for up and coming movies, bands, or TV shows.  Pappillon is currently hot, scoring over 2 standard deviations higher in terms of list position on our best movie list, despite being less popular than most movies.  Another Earth and 13 Assassins,  seem like potentially interesting and under the radar films from 2011. Real Time with Bill Maher‘s niche status may be due to appeal particular ideological group, but Warehouse 13 appealed to just my niche as it had passionate fans on both the best recent TV shows list and the best Sci-Fi TV shows list (it has since graduated from the list due to increased popularity).  Warehouse 13′s highest correlated show is one of my favorites, Battlestar Galactica, so I’m definitely going to check it out.

I tend to be a late adopter of pop culture, but thanks to the niche tag, maybe I can be a little hipper going forward.  Take a look at our niche items as of October 20, 2012 and any comments on other things to consider checking out would be appreciated. Or perhaps take a look in a few months time and consider whether our niche tag successfully captured coming trends in a few cases.

- Ravi Iyer

Incoming search terms:

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

Validating Ranker’s Aggregated Data vs. a Gallup Poll of Best Colleges

We were talking to someone in the market research field about the credibility of Ranker’s aggregated rankings, and they were intruiged and suggested that we validate our data by comparing the aggregated results of one of our lists to the results achieved by a traditional research company using traditional market research methodologies.  Companies like Gallup often do not survey the same types of questions that we ask at Ranker, in part due to the inherent difficulties of open ended polling via random digit dialing.  You can’t realistically call someone up at dinner time and ask them to list their 50 favorite TV shows.  You could ask them to name one favorite, but doing that, you can end up with headlines like “Americans admire Glenn Beck more than they admire the Pope.”  However, one question that both Gallup and Ranker have asked concerns the nation’s top colleges/universities.  How do Ranker’s results compare to Gallup’s data?  Below are our results, side by side.

Ranker vs Gallup Best US Colleges

From a market researcher’s perspective, this is good news for Ranker data.  Our algorithms have successfully replicated the top 4 results from the Gallup poll exactly, at a fraction of the cost.  This likely occurs because Ranker data is largely collected from users who find our website via organic search, so while our data is not a representative probability sample (assuming such a thing still exists in a world where people screen their calls on cellphones), our users tend to be more representative than the motivated Yelp user or the intellectual Quora user.  If you compare how representative Ranker’s best movies list is compared to Rotten Tomatoes aggregated opinion list (Toy Story 2 and Man on Wire are #1 & #2!?!?), you get a sense of the importance of having relatively representative data.

In addition, the fact that our lists are derived from a combination of methodologies (listing, reranking, + voting), means that the error associated with each method somewhat cancels out.  Indeed, one might argue that Ranker’s top dream colleges list is better than Gallup’s for precisely this reason as individuals are often tempted to list their alma mater or their local school as the best college, and the long tail of answers might actually contain more pertinent information.  Aggregating ranked lists from motivated users and combining that data with casual voters might actually be the best way to answer a question like this.

- Ravi Iyer

Incoming search terms:

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

How Ranker leverages Google’s Knowledge Graph

Google recently held their I/O conference and one of the talks was given by Freebase’s Shawn Simister, who was once Freebase’s biggest fan, and has since gone on to work at Google, which acquired Freebase a few years ago.  What is Freebase?  It’s the structured semantic data that powers Google’s knowledge graph and Ranker, along with many other organizations featured in this talk (Ranker is mentioned around the 8:45 mark).  This talk gives organizations that may not be familiar with Freebase an overview of how they can leverage the Freebase’s semantic data.

How does Ranker use the knowledge graph?  Freebase’s semantic data powers much of what we do at Ranker and the below graph illustrates how we relate to the semantic web.

How Ranker Relates to the Semantic Web

We leverage the data from the semantic web, often via Freebase, to create content in list format (e.g. The Best Beatles Songs), which our users then vote on and re-rank.  This creates an opinion data layer that is easily exportable to any other entity (e.g. The New York Times or Netflix) that is connected to the larger semantic web.  Our hope is that just as people in the presentation are beginning to create mashups of factual data, eventually people will also want to merge in opinion data, and we hope to have the best semantic opinion dataset out there when that happens.  The more people that connect their data to the semantic web, the more lists we can create, and the more potential consumers exist for our opinion data.  As such, we’d encourage you to check out Shawn’s presentation and hopefully you’ll find Freebase as useful as we do.

- Ravi Iyer

 

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

Siri (and other mobile interfaces) will eventually need semantic opinion data

Search engines, which process text and give you a menu of potential matches, make sense when you use an interface with a keyboard, a mouse, and a relatively large screen. Consider the below search for information about Columbia.  Whether I mean Columbia University, Columbia Sportswear, or Columbia Records, I can relatively easily navigate to the official website of the place that I need.

Mobile devices require specificity as the cost of an incorrect result is magnified by the limits of the user interface.  When using something like Siri, it is important to be able to give a precise answer to a question, rather than a menu of potential answers, as it is far harder to choose using these interfaces.  As technology gets better, we will start to expect intelligent devices to be able to make the same inferences that we are able to make about what we mean when given limited information.  For example, if I say “how do I get to Columbia?” to my phone while in New York, it should direct me to Columbia University, whereas in Chicago, it should direct me to Columbia College of Chicago.  Leveraging contextual information is part of what makes Siri special, as it allows you to, for example, use pronouns.  Some have said that Siri has resurrected the semantic web, as, in order to make the above choice of “Columbia” intelligently, it needs to know that Columbia University is located in New York while Columbia College is located in Chicago.

I have made the case before that people are increasingly seeking opinion data, not just factual data, online.  It bears repeating that, as depicted in the below graph, searches for opinion words like “best” are increasing, relative to factual words like “car”, “computer”, and “software” which once were as prevalent as “best”, but now lag behind.

The implication of these two trends is clear.  As more knowledge discovery is done via mobile devices that need semantic data to deliver precise contextual answers, and more knowledge discovery is about opinions, then mobile interfaces such as Siri, or Google’s answer to Siri, will increasingly require semantic opinion data sets to power them.  Using such a dataset, you could ask your mobile device to “find a foreign movie” while travelling and it could cross-reference your preferences with those of others to find the best foreign movie that happens to be playing in your geographic area and conforms to your taste.  You could ask your mobile device to play some Jazz music, and it could consider what music you might like or not like, in addition to the genre classifications of available albums.  These are the kinds of intelligent operations that human beings do everyday, leveraging our knowledge both of the world’s facts and the world’s opinions and in order to do these tasks well, any intelligent agent attempting these tasks will require the same set of structured knowledge, in the form of a semantic opinions.  Not coincidentally, Ranker’s unique competency is the development of a comprehensive semantic opinion dataset.

- Ravi Iyer

Incoming search terms:

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)