We use cookies in order to improve the quality and usability of the HSE website. More information about the use of cookies is available here, and the regulations on processing personal data can be found here. By continuing to use the site, you hereby confirm that you have been informed of the use of cookies by the HSE website and agree with our rules for processing personal data. You may disable cookies in your browser settings.

  • A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

LINIS Found the Limitations of Text Clustering on the Internet

LINIS Found the Limitations of Text Clustering on the Internet

Sergey Koltsov, Deputy Director of the Laboratory for Internet Studies (LINIS), presented his project on the problems of topic modeling of on-line texts at the Web Science conference in Bloomington.

Web Science studies the vast information network of people, communities, organizations, applications, and policies that shape and are shaped by the Web. Computing, physical, and social sciences come together, complementing each other in understanding how the Web affects our interactions and behaviors.

Sergey Koltsov (LINIS) got notable feedback for his paper at the ACM Web Science Conference in Bloomington, USA. In it he discussed the unresolved methodological problem of clustering of large text collections obtained on-line, in particular the issue of instability of the topic modeling algorithm. During experiments at LINIS it was found that different solutions produced by this algorithm are not just slightly different, but they differ dramatically so that no conclusions about the topical composition of the collection can be drawn. LINIS is currently working on methods to stabilize topic modeling results.

The conference was at Indiana university at the end of June and was co-sponsored by Google, Microsoft, Facebook and other businesses. This highly selective event included 30 presentations with an  acceptance rate of less than one third. The presentation that was awarded the status of the best paper analyzed 2.3 million tweets devoted to the Gezi park protests in Turkey and, among other things, it found out that with time the discussion of the problem became more democratic and the ability to influence other users more equally distributed.