Following the wake of several women coming forward against Harvey Weinstein, on October 15th, 2017, Alyssa Milano started an online movement behind the hashtag #metoo. She posted, “If you’ve been sexually harassed or assaulted write 'me too' as a reply to this tweet” (@Alyssa_Milano). What followed was a flood of stories, building a community of support, natively and primarily through social media. The movement encouraged more women to come forward — not only

validating the experience of victims, but exposing more perpetrators beyond Weinstein. But is that all that was said within #metoo? This project explores the text of tweets from the 6 months following the birth of the digitally native social movement. By using unsupervised k-means cluster analysis, we can uncover organic themes. The project aims to answer the question: “what are people really saying with #metoo?”

Data Collection

#metoo was shared millions of times from countries all over the world. It sparked new hashtags in other languages - like #balancetonporc and #yotambien. The data used in this project was scraped from the public twitter search page. The only requirement was that the tweet used the hashtag #metoo and was tweeted between October 14th and April 14th. The result was nearly 1.4 million tweets —

1,392,076, to be exact. The below bar chart is a log scale of tweets per day over 6 months. There are clear peaks for the first wave of #metoo tweets (October 16), the release of the silence breakers as Time’s Person of the Year (December 5th), the Golden Globes and the announcement of the #timesup movement (January 7th), and the Oscars (March 4th).

Cluster Formations

K-means clustering is an unsupervised machine learning process that uses the input to find natural groups in the data. In this case, the only information being used to create these groups was the words in the tweets. For example, if the word “trump” or “vote” was used, the tweet was assigned to be grouped with other tweets that use those words as well. This process is critical for this project because it does not require human intervention. Through k-means clustering, we can let the words “speak for themselves” and create groups just by the nature of the words within.

The roughly 1.4 million tweets in this dataset are analyzed using “bag-of-words”, a process which identifies unique words and finds them across the entire corpus. Because of the size of data, 26,193,288 unique words were found. To make this more manageable, the clustering process only used a small portion of them — the word must be present in at least 0.5% of the corpus and cannot be present in 99% of the corpus. That narrowed the total words down to 348.

The clustering process considers if the word was included — as well as relationships to other words. The result is 425 clusters, each with a group of tweets within them.

Overall Clusters

The final clusters vary by size. The largest cluster, with over 81,000 tweets, serves as a “catch-all” for those tweets without an obvious direction or intention. The remaining cluster sizes vary from 17,000 - 32. Hover over a cluster to find out more, including its size, name, and top 10 words in that cluster.


Although each individual cluster is interesting. 425 is quite a lot to parse through. By applying a qualitative lens and examining the top words in each tweet and the top tweets in each cluster, we can find some interesting themes brought to life from the data. The following are just a few of the themes that arose from the

clusters. Each theme is comprised of anywhere from 2-10 individual clusters. Each circle represents a single tweet, and the size of the circle is representative of the tweet’s engagement. The top 1000 tweets from each theme are represented here, but there are many more in each cluster.

The political clusters

Although #metoo born through the experiences in Hollywood, the effect reverberated. In these six clusters, the tweets discuss #metoo in politics from both sides of the aisle.

Top words: vote, please, moore, democrat, president, american, donald, end, country, trump.

The workplace clusters

A core tenant of the movement is the abuse of power. This has sparked conversation about sexual harassment in the workplace, as well as inspired women to share their workplace experiences.

Top words: work, workplace, changing, thanks, discuss, issue, power, abuse, conversation.

The angry clusters

It’s no secret that online activity includes toxic language. These few clusters, riddled with swear words and negative tone, illuminate the anger of both supporters and critics.

Top words: disgusting, fucked, twitter, abuse, guy, shit, behavior, violence.

The conversation clusters

The power of the movement comes from the stories driving conversation. These four clusters include how the movement is driving conversation, inspiring stories, and encouraging change.

Top words: discuss, conversation, gender, survivors, equally, fight, supporting, story, help.

The uplifting clusters

Beyond just driving change, the movement has created a community of support. These clusters, as well as several more, commend the brave women and men coming forward.

Top words: truth, strong, inspired, brave, courage, together, let, stand, love, amazing, voices.