How to scrape Twitter with Soax Residential Proxy Service

How to scrape Twitter with Soax Residential Proxy Service

Kaiwalya Koparkar
·Aug 27, 2022·

4 min read

Subscribe to our newsletter and never miss any upcoming articles

Hey everyone, have you ever wanted to get information about a your or person's latest tweet? In the general scenario you would visit the person's profile and check it there itself but what if we can automate this process using python so that you can get the information about the tweet directly onto your console? Sounds exciting right? so let's get into it. Lets start gif

Reason of using Soax as proxy service

Hmm.. this is bit unusual step while starting but an important one let me explain. While this is a small application but we are creating a script which crawls the twitter website and by obvious reason twitter prevents bots and other scripts to run on it's platform. Also when you run such scripts directly from your machine it highly vulnerable as some attacker might get access to your machine information. This is where Soax comes into picture. You can read more about Soax and it's services here. Now that we know why we need soax let's go ahead and set it up

Setting up Soax

  1. Singup to the website and complete your authentication and purchase of the residential proxy plan (You can do it as per your choice and requirements)
  2. Once you signup and have a package it's time to set up your proxy server. There are mainly two methods to set up a proxy server. Both of these setup methods are nicely explained here
    • Setting up proxy server by whitelisting your IP address
    • Setting up proxy server using username and password authentication
  3. Once you have setup a proxy server you are done with your first step and now have a 100% secure and anonymous connection.

Now that we have a secure connection, let's actually go ahead and build our application/script for scraping the twitter

Creating Scraper Application

Great, so as you have set up the soax proxy, we are ready to create our scraper application which scrapes a particular page of Twitter and logs the tweets, likes, retweets and number of replies to the recent tweet. We will be using python packages like Selenium and chromedriver to connect with the proxy (though you have setup and connected to the proxy, there is another method where you can directly connect to the soax proxy server using python code), crawl the twitter website and scrape the data. Following is the code written for scraping the data: Also on GitHub


# Import Dependencies
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options 
from time import sleep

PATH = "./chromedriver"
chrome_options = Options() 
chrome_options.add_extension("proxy.zip") 

driver = webdriver.Chrome(PATH, options=chrome_options)
driver.get("https://twitter.com/login")

subject = "Kaiwalya Koparkar"


# Setup the log in
sleep(3)
username = driver.find_element(By.XPATH, "//input[@name='text']")
username.send_keys('')
next_button = driver.find_element(By.XPATH, "//span[contains(text(),'Next')]")
next_button.click()

sleep(3)
password = driver.find_element(By.XPATH, "//input[@name='password']")
password.send_keys('')
log_in = driver.find_element(By.XPATH, "//span[contains(text(),'Log in')]")
log_in.click()

# Search item and fetch it
sleep(3)
search_box = driver.find_element(
    By.XPATH, "//input[@data-testid='SearchBox_Search_Input']")
search_box.send_keys(subject)
search_box.send_keys(Keys.ENTER)

sleep(3)
people = driver.find_element(By.XPATH, "//span[contains(text(),'People')]")
people.click()

sleep(3)
profile = driver.find_element(
    By.XPATH, "//*[@id='react-root']/div/div/div[2]/main/div/div/div/div/div/div[2]/div/section/div/div/div[1]/div/div/div/div/div[2]/div/div[1]/div/div[1]/a/div/div[1]/span/span")
profile.click()

UserTags = []
TimeStamps = []
Tweets = []
Replys = []
reTweets = []
Likes = []

articles = driver.find_elements(By.XPATH, "//article[@data-testid='tweet']")

while True:
    for article in articles:
        UserTag = driver.find_element(By.XPATH,".//div[@data-testid='User-Names']").text
        UserTags.append(UserTag)

        TimeStamp = driver.find_element(By.XPATH,".//time").get_attribute('datetime')
        TimeStamps.append(TimeStamp)

        Tweet = driver.find_element(By.XPATH,".//div[@data-testid='tweetText']").text
        Tweets.append(Tweet)

        Reply = driver.find_element(By.XPATH,".//div[@data-testid='reply']").text
        Replys.append(Reply)

        reTweet = driver.find_element(By.XPATH,".//div[@data-testid='retweet']").text
        reTweets.append(reTweet)

        Like = driver.find_element(By.XPATH,".//div[@data-testid='like']").text
        Likes.append(Like)
    driver.execute_script('window.scrollTo(0,document.body.scrollHeight);')
    sleep(3)
    articles = driver.find_elements(By.XPATH,"//article[@data-testid='tweet']")
    Tweets2 = list(set(Tweets))
    if len(Tweets2) > 5:
        break

print("Tweet: ",Tweets[0], "\nReplys: ", Replys[0], "\nRetweets: ", reTweets[0], "\nLikes: ", Likes[0])

Scraping and testing

It's finally time to test and see how our application scrapes the data from twitter and how it gets logged into the proxy server. Following is the terminal image that shows the scraped information about the tweet Image

Wonder how it would have logged into the soax proxy server. Well the soax dashboard helps you analyze the monthly traffic by your application. Additionally you can also set up ip rotation which spins your ip location after the specified time. Take a look at the image below Monthly traffic spent Ip rotation

Resources followed

Thank you so much for reading ❤️

Thank you gif

Did you find this article valuable?

Support Kubeworld by becoming a sponsor. Any amount is appreciated!

See recent sponsors Learn more about Hashnode Sponsors
 
Share this