Headless Web Scraping in Python with Beautiful Soup 4

Setting Up the CentOS Instance on Alibaba Cloud

Installing Python 3, PIP3 and Nano from the Terminal Command Line

sudo yum update
sudo yum install https://centos7.iuscommunity.org/ius-release.rpm
sudo yum install python36u
sudo yum install python36u-pip
sudo yum install nano

Install our Python Packages through Pip

pip36u install requests
pip36u install beautifulsoup4
mkdir Python_apps
cd Python_apps

Writing Our Headless Scraping Bot in Python

############################################################ IMPORTS
import requests
from bs4 import BeautifulSoup
####### REQUESTS TO GET PAGE : BS4 TO PARSE DATA
#GLOBAL VARS
####### URL FOR SITE TO SCRAPE
url = input("WHAT URL WOULD YOU LIKE TO SCRAPE? ")
####### REQUEST GET METHOD for URL
r = requests.get("http://" + url)
####### DATA FROM REQUESTS.GET
data = r.text
####### MAKE DATA VAR BS4 OBJECT
source = BeautifulSoup(data, "html.parser")
####### USE BS4 PRETTIFY METHOD ON SOURCE VAR NEW VAR PRETTY_SOURCE
pretty_source = source.prettify()
print(source)
print(pretty_source)
####### OPEN SOURCE IN WRITE MODE WITH "W" TO VAR LOCAL_FILE
####### MAKE A NEW FILE
local_file = open(url.strip("https://" + "http://") + "_scrapped.txt" , "w")
####### WRITE THE VAR PRETTY_SOUP TO FILE
local_file.write(pretty_source)
### GET RID OF ENCODING ISSUES ##########################################
#local_file.write(pretty_source.encode('utf-8'))
####### CLOSE FILE
local_file.close()
python3.6  bot.py
print("*" * 30 )
print("""
#
# SCRIPT TO SCRAPE AND PARSE DATA FROM
# A USER INPUTTED URL. THEN SAVE THE PARSED
# DATA TO THE LOCAL HARD DRIVE.
""")
print("*" * 30 )
############################################################ IMPORTS
import requests
from bs4 import BeautifulSoup
####### REQUESTS TO GET PAGE : BS4 TO PARSE DATA
#GLOBAL VARS
####### URL FOR SITE TO SCRAPE
url = input("ENTER URL TO SCRAPE")
####### REQUEST GET METHOD for URL
r = requests.get(url)
####### DATA FROM REQUESTS.GET
data = r.text
####### MAKE DATA VAR BS4 OBJECT
source = BeautifulSoup(data, "html.parser")
####### USE BS4 PRETTIFY METHOD ON SOURCE VAR NEW VAR PRETTY_SOURCE
pretty_source = source.prettify()
print(source)print(pretty_source)####### OPEN SOURCE IN WRITE MODE WITH "W" TO VAR LOCAL_FILE
####### MAKE A NEW FILE
local_file = open(url.strip("https://" + "http://") + "_scrapped.txt" , "w")
####### WRITE THE VAR PRETTY_SOUP TO FILE
local_file.write(pretty_source)
#local_file.write(pretty_source.decode('utf-8','ignore'))
#local_file.write(pretty_source.encode('utf-8')
####### CLOSE FILE
local_file.close()

Summary

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store