Verified Twitter Usernames
How to get a list of all verified accounts on twitter !?
A year ago, I faced an interesting problem where I had to get the usernames
of all the verified accounts
from twitter.
How would you go about it?
Following is my journey of how I got to the final solution.
TLDR: There is
no single/correct approach
to solve this.
Brute Force 😂 🤷♂
The worst possible approach would be to iterate
over all the possible usernames
from length 1 to max length
of a username
that twitter allows, hit the twitter API for each, and aggregate all that were marked as verified.
But there is a catch, retrieving
user details
using twitter’s API has arate limit
of900 requests/15 minute window
.
An unknown number
of total users present on twitter, a Space Complexity
of probably infinity
, and Time Complexity of O(n)
of n
that is unknown
are only a few of the issues with this brute force
method has. There was no point bothering to even write a script for this.
Check out the Twitter API rate limits here: https://developer.twitter.com/en/docs/twitter-api/v1/rate-limits
At this point, I needed some more data to measure the size
of the problem. Also someplace from where I can somehow get only verified usernames
( fetching
them was a different ballgame altogether ).
twitter@verified
This was a very important breakthrough.
After some research, I found that there is a twitter account by the name of twitter@verified, which followed all the verified twitter accounts
.
Looking at the details of that account, I found out that the total number of verified accounts
on twitter was 361.7K
.
Now that I knew the size of my problem
, it looked like I could easily get a list of users this account follows
, and voila! I thought I would just scrape the website
to get that list and so I started writing a script right away.
I had no clue about the mammoth size of the problems that were waiting for me on the next page.
Scraping twitter web 🕸 💻
This was my first attempt on web scraping
, I directly used beautiful soup
for this. Honestly, I did not think it through and was just rushing, which was followed by a series of unpleasant surprises.
Beautiful Soup
About beautiful soup …
https://pypi.org/project/beautifulsoup4/
Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.
Initially, when I executed the script, I got some usernames
but noticed that I was only able to fetch the first page, and needed to go to the next page for more names, viz-à-viz pagination
.
But later I realized that twitter is not paginated
, rather lazy-loaded
. So, in order to get the next set of usernames I needed to scroll the page
, which sadly was not possible with beautiful soup !! 😩
At this point, I had only
60–70 usernames
aggregated.
Some research later I concluded that the next step was an introduction to selenium
. I thought it would solve the problem and went onto modify the script.
About Selenium
https://pypi.org/project/selenium/
Selenium is an open-source web-based automation tool.
PS: Selenium also allowed me to easily perform Authentication
on twitter, which could otherwise have been a problem for running a script
of this kind.
After running the script, my expectation was that my problem was solved, but my computer started facing multiple out of memory crashes
and no amount of optimizations brought me any closer to the end game.
At this point, I had gathered around
2000 verified usernames
.
Some further research later I stumbled upon selenium’s headless mode
.
About Selenium headless mode
Running selenium in
headless
modedoes not open a GUI (Graphical User Interface)
, whichsaves a lot of local memory
.
Multiple attempts later, the results got a little better but not good enough to solve the problem.
At this point, I had gathered around
4000-5000 verified usernames
.
Selenium Script
A version of that script till this point can be found in the following gist.
https://gist.github.com/DipanshKhandelwal/f7acfded1b547fbd76c2b7d7810e6dd9
The taste of failure had slowly started to make me give up. It seemed to me that this problem could only be solved by paying for twitter’s API
, but as my engineering undergrad instinct had taught me, don’t pay for anything, unless you absolutely have to.
The final solution 🎊
Another one of my engineering instincts had taught me that if you can’t solve a problem alone you can definitely try it with your friends.
So I collected some twitter API keys
from my friends and waited for the long-running script to fulfill its destiny!
The following is a brief description of the final solution.
Each unique key
allowed me to get around 60 usernames in each hit
after which consecutive hits using that key returned a limit exceeded exception
for a window of 15 minutes
.
To solve this I created an iterative system
where each key got expired
for a window of 15 minutes
after returning its initial set of 60 usernames
, and similarly this followed suit for the next key, adding an extra timeout
of 15 mins after the execution of the last key
.
Obviously, a script of this magnitude would be handicapping for my local system, hence I used an ec2 instance
to run it, the execution of which got completed roughly in a day
.
Finally, I had aggregated all
~360.7K verified twitter usernames
.
Congratulations!! You made it till the end.
You can find the final script here:
https://gist.github.com/DipanshKhandelwal/92e3d51531e3e01a14fa51bd50eec6ff
Hit me up if you have any doubts, I’ll be glad to help someone in need.
Please let me know your thoughts on it, and also other solutions that you can think of 😃 !!
Thank you for reading 😃
Hey! I am Dipansh Khandelwal, a Computer Science Engineer and a Full Stack Developer. I have been programming for around 4 years now, working remotely most of the time, with a wide tech stack including native Android and iOS, React Native, React, Firebase, Django, Express and more. I like to listen to Audiobooks and play Badminton.
You can connect with me on LinkedIn@dipanshkhandelwal and check out some more work at GitHub@DipanshKhandelwal.