Volume 11, Number 5
Achraf Lassoued, University of Paris II and IRIF-CNRS, France
We observe a stream of text messages, generated by Twitter or by a text file and present a tool which constructs a dynamic list of topics. Each tweet generates edges of a graph where the nodes are the tags and edges link the author of the tweet with the tags present in the tweet. We consider the large clusters of the graph and approximate the stream of edges with a Reservoir sampling. We study the giant components of the Reservoir and each large component represents a topic. The nodes of high degree and their edges provide the first layer of a topic, and the iteration over the nodes provide a hierarchical decomposition. For a standard text, we use a Weighted Reservoir sampling where the weight is the similarity between words given by Word2vec. We consider dynamic overlapping windows and provide the topicalization on each window. We compare this approach with the Word2content and LDA techniques in the case of a standard text, viewed as a stream.
NLP, Streaming algorithms, Clustering, Dynamic graphs.