Data Collection Bot with Python and Selenium

Alibaba Cloud
8 min readJul 3, 2019

--

By Mark Andrews, Alibaba Cloud Tech Share Author. Tech Share is Alibaba Cloud’s incentive program to encourage the sharing of technical knowledge and best practices within the cloud community.

Say we wanted to gather data in real time, and we want to plug this real time data into any Alibaba Cloud service we want for analytics. With an Alibaba Elastic Compute Service (ECS) instance, we can achieve this.

Today we’ll be making our very own data collection “Bot” with Python and Selenium. Selenium web driver is a very powerful extension to our Python tool kit. With Selenium we can mimic many user behaviors such as clicking on a specific element or hovering the mouse in a particular screen position. We can also automate many repetitive day to day tasks like data scraping such as retrieving real time cryptocurrency market data, as we do in this tutorial. This data can be passed into any other Alibaba cloud service.

First lets instantiate and login to a GUI based Windows Alibaba Elastic Compute Service instance. I have chosen Windows Server 2016 as it seems to run sufficiently fast enough. The main point here is that we can see the Python Selenium script work its magic in a graphical environment. At some point we may wish to run headless but that is beyond the scope of this article.

Overview

In this article, we will:

1) Instantiate an Alibaba Cloud Windows Instance

2) Install

  • Firefox
  • Python 3
  • Selenium
  • Gecko Webdriver
  • Notepad ++

3) Write our Python Selenium script to check crypto markets.

  • Write a class for our “Bot”
  • Get_crypto_data function within our class
  • Write_file function for the class to write the real time data to a local cloud file.

Log on to your Alibaba Cloud Windows Instance to begin.

Installation

Open up Internet Explorer to navigate to mozilla.org. You may need to add some exceptions to the built in Windows firewall to enable the Firefox installer to download.

Download Firefox from www.mozilla.org

Download the Python 3.7.1 MSI installer from http://www.python.org/download/ Select 64 bit based executable.

Run the installer. Be sure to check the option to add Python to your PATH while installing.

The Windows version of Python 3.7.1 includes the pip installer. This will make it easy for us to install all the Python modules we will need. Specifically Selenium web driver.

Navigate to the Windows Power Shell interface and lets make sure that Python was successfully installed.

python --version

We should see Python 3.7.1 here

So let’s install Selenium with pip.

pip install selenium

Now we need to install the proper web driver for Selenium. Since we are using Mozilla for a browser we need the Geckodriver. The process is similar to installing the Chrome web driver.

https://github.com/mozilla/geckodriver/releases/download/v0.23.0/geckodriver-v0.23.0-win64.zip

Save the link to a local file on your Alibaba Windows instance. Then extract the zip. I extracted to the desktop. Making sure to add the file to the system path in the Power Shell.

setx path "%path%;c:\Users\Administrator\Desktop\geckodriver.exe"

Now lets get our development environment organized. You can always just use notepad to write your code. I prefer to use notepad++ a free open source IDE for the syntax highlighting.

https://notepad-plus-plus.org/repository/7.x/7.6/npp.7.6.Installer.x64.exe

Python Script

Now let’s decide what we want our Python Selenium “Web Bot” to do. Well, it would be cool if our “bot” would grab some crypto currency market data for us. Write that data to a local file. Then we can plug that data into any other Alibaba cloud service. Get ready to type!

Now let’s get to coding. Fire up your IDE and lets get Selenium imported. We also import time and datetime as will be explained later.

from selenium import webdriver
import datetime
import time

First lets get Selenium to actually do something so we know what to expect.

browser = webdriver.Firefox()
browser.get("http://www.baidu.com/")

Save your file with a .py extension.

Double click your file.

There we go in three lines of code we have Selenium webdriver programatically opening a web page.

Lets create our “Bot” class.

It’s all ways a good idea to take a moment and think of exactly what we want our program to accomplish.

I like to do this in steps. We can then break these steps down into functions for our Bot.

1) We want to programmatically go to a web page.
2) We want to scrape some data
3) We want to pass that data as variables
4) We want to store that data in a file on the cloud

Let’s start to define a basic Selenium “Bot” class.

from selenium import webdriverclass Bot():
def __init__(self,url):
self.browser = webdriver.Firefox()
self.url = url
def get_web_page(self):
self.browser.get(self.url)
bot = Bot("http://www.baidu.com/")
bot.get_web_page()

The above is a more streamlined class based structure to organize our program.

We set up our class. Define the browser variable as Firefox Webdriver. Define our URL we want to pass through the class. Then we make a basic get web page function where we call the browser with the URL that will be defined upon initialization.

Lets run our script. Our script should initialize and run our Bot class. Selenium webdriver should open up a Firefox browser window and programatically go to www.baidu.com.

Before we define our get crypto data function. We are going to www.coinwatch.com and getting the xpath of the elements we want data from. We will then scrape the market data of Bitcoin.

The web developer tools in Firefox are adequate for this task so we will use that to find the xpath of the elements we need.

Go to www.coinwatch.com and navigate the mouse over the search bar in the upper right.

Right click on the search box element and select Inspect Element from the menu.

You will see the inspector window open on the bottom of the page with the element highlighted. Right click on the highlighted text in the inspector. Navigate to the Copy sub menu and select copy Xpath. Now the Xpath of that element should be stored in the clipboard to paste into our Python Selenium script. Let’s get the xpath to the submit button while were at it as well.

Let’s get working!

from selenium import webdriver
class Bot():
def __init__(self,crypto):
self.browser = webdriver.Firefox()
### CRYPTO TO SEARCH FOR
self.crypto = crypto
def get_crypto_data(self):
self.browser.get("https://www.coinwatch.com/")
### FIND SEARCH FORM AND IMPUTE KEY WORD
print("FINDING SEARCH FORM")
element = self.browser.find_element_by_xpath("//html/body/div[1]/div/div/div/div/header/section[3]/div/div/div/div[1]/div/input")
element.clear()
element.send_keys(self.crypto)
element = self.browser.find_element_by_xpath("//html/body/div[1]/div/div/div/div/header/section[3]/div/div/div/div[1]/div/div/button")
element.click()
bot = Bot("BTC")
bot.get_crypto_data()

In the above we are initializing the web driver. We define a variable crypto that will be our crypto currency search term. We then pass that crypto variable through the class with the self parameter to the get_crypto_data function. In the get_crypto_data function we call the web browser to get “https://www.coinwatch.com/ “. Once we call the web page we locate the element through the xpath we copied in. The first element is the search box. We first clear the search element and then send our crypto variable as an argument. Then find and click on the submit button. Instantiate our Bot as bot and pass in “BTC” as the crypto we want to search for.

We should be taken to the Bitcoin main page . Let’s take the opportunity now to get and copy all the Xpaths to all the elements we want to collect data from. I figure for now we need price, cap, 24 hr volume and 24 hr change. Lets make variables to hold those valuse in our script. Then we pass those class wide variables to our get_crypto_data function. After the element is located we grab the text contents of the element with the Selenium .text method. We then print them out in the terminal. Certain web pages take some time to load the elements of the DOM so we may need to wait a certain period of time if we are getting errors like elements not found. The simplest(though definitely not the best) way of doing this is with time.sleep() so we have imported time and added some sleep intervals to let the elements load.

Let’s run it and see what happens.

from selenium import webdriver
import datetime
import time
class Bot():
def __init__(self,crypto):
self.browser = webdriver.Firefox()
### CRYPTO TO SEARCH FOR
self.crypto = crypto
self.price = None
self.cap = None
self.volume_24h = None
self.change_24h = None
def get_crypto_data(self):
self.browser.get("https://www.coinwatch.com/")
### FIND SEARCH FORM AND IMPUTE KEY WORD
print("FINDING SEARCH FORM")
elem = self.browser.find_element_by_xpath("//html/body/div[1]/div/div/div/div/header/section[3]/div/div/div/div[1]/div/input")
elem.clear()
elem.click()
elem.send_keys(self.crypto)
elem = self.browser.find_element_by_xpath("//html/body/div[1]/div/div/div/div/header/section[3]/div/div/div/div[1]/div/div/button")
time.sleep(1)
elem.click()
### GET MARKET DATA
print("GETTING MARKET DATA")
time.sleep(5)
self.price = self.browser.find_element_by_xpath("//html/body/div[1]/div/div/div/div/main/div/div/article/div[1]/div/div[3]/div/div/span").text
self.cap = self.browser.find_element_by_xpath("//html/body/div[1]/div/div/div/div/main/div/div/article/div[2]/div/main/div[1]/section/div[1]/div/div[2]/div/div[1]/span[2]").text
self.volume_24h = self.browser.find_element_by_xpath("//html/body/div[1]/div/div/div/div/main/div/div/article/div[2]/div/main/div[1]/section/div[1]/div/div[2]/div/div[2]/span[2]").text
self.change_24h = self.browser.find_element_by_xpath("//html/body/div[1]/div/div/div/div/main/div/div/article/div[2]/div/main/div[1]/section/div[1]/div/div[2]/div/div[8]/span[2]").text
print("PRINTING MARKET DATA")
print("PRICE " + str(self.price))
print("CAP " + str(self.cap))
print("24H VOLUME " + str(self.volume_24h))
print("24H CHANGE " + str(self.change_24h))
self.browser.close()
return
bot = Bot("BTC")
bot.get_crypto_data()

Great now we have our data as variables that we can pass throughout our Bot class. Lets pass them to a write_file function. Then save them to a local file on the Alibaba cloud with a system time stamp.

def write_file(self):
save_file = open(self.crypto + "_market.txt","a")
time_stamp = str(datetime.now())
save_file.write("\n" + time_stamp + "\n")
save_file.write("\nPRICE: " + str(self.price))
save_file.write("\nCAP: " + str(self.cap))
save_file.write("\n24H VOLUME: " + str(self.volume_24h))
save_file.write("\n24H CHANGE: " + str(self.change_24h))

In the above code we open a file with the crypto variable concatenated on to market.txt as the name of the file to write. The “a” opens the file in append mode so we don’t overwrite our previous data. Next we make a time_stamp variable to call the date and time as a string. We then write to the save_file the market data as a string. The “n” concatenated on is a line break so everything is not on one line in the file.

Of course we can scrape any crypto this way by changing the argument when we initialize the Bot class. Lets scrape some Etherium market data.

btc_bot = Bot("BTC")
eth_bot = Bot("ETH")
btc_bot.get_crypto_data()
btc_bot.write_file()
eth_bot.get_crypto_data()
eth_bot.write_file()

There we go we have an automated way to grab a data set and return it as variables. We can write it to a local file. This data can be plugged into an Alibaba machine learning service for data analysis for instance or a relational database for tracking.

Following is the final code

from selenium import webdriver
from datetime import datetime
import time
class Bot(): def __init__(self,crypto):
self.browser = webdriver.Firefox()
### CRYPTO TO SEARCH FOR
self.crypto = crypto
self.price = None
self.cap = None
self.volume_24h = None
self.change_24h = None
def get_crypto_data(self):
self.browser.get("https://www.coinwatch.com/")
### FIND SEARCH FORM AND IMPUT KEY WORD
print("FINDING SEARCH FORM")
elem = self.browser.find_element_by_xpath("//html/body/div[1]/div/div/div/div/header/section[3]/div/div/div/div[1]/div/input")
elem.clear()
elem.click()
elem.send_keys(self.crypto)
elem = self.browser.find_element_by_xpath("//html/body/div[1]/div/div/div/div/header/section[3]/div/div/div/div[1]/div/div/button")
time.sleep(1)
elem.click()
### GET MARKET DATA
print("GETTING MARKET DATA")
time.sleep(5)
self.price = self.browser.find_element_by_xpath("//html/body/div[1]/div/div/div/div/main/div/div/article/div[1]/div/div[3]/div/div/span").text
self.cap = self.browser.find_element_by_xpath("//html/body/div[1]/div/div/div/div/main/div/div/article/div[2]/div/main/div[1]/section/div[1]/div/div[2]/div/div[1]/span[2]").text
self.volume_24h = self.browser.find_element_by_xpath("//html/body/div[1]/div/div/div/div/main/div/div/article/div[2]/div/main/div[1]/section/div[1]/div/div[2]/div/div[2]/span[2]").text
self.change_24h = self.browser.find_element_by_xpath("//html/body/div[1]/div/div/div/div/main/div/div/article/div[2]/div/main/div[1]/section/div[1]/div/div[2]/div/div[8]/span[2]").text
print("PRINTING MARKET DATA")
print("PRICE " + str(self.price))
print("CAP " + str(self.cap))
print("24H VOLUME " + str(self.volume_24h))
print("24H CHANGE " + str(self.change_24h))
self.browser.close()
return
def write_file(self): time_stamp = str(datetime.now()) save_file = open(self.crypto + "_market.txt","a")
save_file.write(self.crypto)
save_file.write("\n" + time_stamp + "\n")
save_file.write("\nPRICE: " + str(self.price))
save_file.write("\nCAP: " + str(self.cap))
save_file.write("\n24H VOLUME: " + str(self.volume_24h))
save_file.write("\n24H CHANGE: " + str(self.change_24h) + "\n")
btc_bot = Bot("BTC")
eth_bot = Bot("ETH")
btc_bot.get_crypto_data()
btc_bot.write_file()
eth_bot.get_crypto_data()
eth_bot.write_file()

Original Source

https://www.alibabacloud.com/blog/data-collection-bot-with-python-and-selenium_594948?spm=a2c41.13076703.0.0

--

--

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com