By Zhu Xiaoran, nicknamed Yuheng at Alibaba.
The rapid development of the Internet, especially in China, promoted the evolution of many new types of media. For example, now there’s internet celebrities-or youtubers, instagramers, or more generally influencers, whatever you prefer to call them-and then their huge Internet-hooked audiences, who use their mobile phones to post their opinions, too. In China, the two biggest platforms of this kind are Weibo and WeChat Moments, but other websites are out there, too. For instance, many Chinese netizen will post their thoughts and reviews on review websites and the review section of e-commerce platforms like Taobao. So, in short, the modern face of new media with all of its platforms gives everyone a platform to be heard and express their opinions.
And, interestingly also, the broadcast speed of both breaking news and entertainment gossip is now far faster than ever before-and much faster than we can even imagine. Information is forwarded tens of thousands of times and read millions of times every few minutes. And, in line with this fast flow of data, large amounts of information may be transferred in a rather viral way. So, even now more than ever before, it is crucial for enterprises to monitor the public opinion about them and act accordingly when the Internet’s opinions strike.
And, landing at the intersection of big data business-where there’s also new retail and e-commerce-we can also see that, along with the rise of “Internet opinions,” things like order quantity and user comments have a huge impact on subsequent orders as well as the economy as a whole. For example, product designers need to be able to collect these sorts of statistics, analyzing the data to determine possible future directions in the development of their products.
And, at the same time, enterprise public relations and marketing departments also need to be equipped to handle public opinion promptly and properly. To answer all of this, and to implement this analysis, our traditional public opinion system has been upgraded to the big data-based public opinion collection and analysis system, which will be discussed in detail in this article.
The requirements of the big data public opinion system on the data storage and computing systems are as follows:
- Real-time Import of a Large Amount of Raw Data: Implementing a complete public opinion system requires collecting the upstream raw data via a crawler. The crawler system collects web content from various portals and self-media sources. The crawler system must deduplicate information before capturing and analyze and extract information after capturing, for example, capturing of sub web pages.
- Raw Web Data Processing: Whether it is information on a mainstream portal or a self-media web page, you need to extract it after capturing and convert the raw web content into structured data, such as the article title and abstract. It is critical to extract effective commodity comments as well.
- Public Opinion Analysis of Structured Data: After the raw output is structured, you need a real-time computing product to classify the output and tag the classified data according to the sentiment. The different outputs may generate based on the business requirements, for example, whether a brand has hot topics, the effect analysis of public opinion, broadcast path analysis, participant statistics collection and profiling, sentiment analysis of public opinion, or the emergence of any major warning.
- Storage, Interactive Analysis, Intermediate Query, and Final Data in the Public Opinion System: Many types of data are generated from the cleansing of raw web data to the final public opinion report. Some data will be provided to data analysts for the optimization of the public opinion analysis system, whereas some will be provided to business departments for decision-making based on public opinion analysis results. These queries may be flexible, requiring full-text retrieval, and multi-field interactive analysis capabilities of our storage system.
- Real-time Warning of Major Public Opinion Events: In addition to normal search and presentation of public opinion results, a real-time warning must be implemented when a major event occurs.
This article is a part of two-part article series that introduces the whole architecture of the public opinion analysis system. As the first article in this series, this article will provide the architecture design of the current mainstream big data computing architecture, analyze the advantages and disadvantages, and then introduce the big data architecture of the public opinion analysis system. The second article will describe a complete database table design and provide a sample code.
The following figure depicts the flowchart of the public opinion analysis system with a large amount of data. This is in many ways a simplification. It is based on the description of the public opinion system at the beginning of this article.
- The original web page repository must support a large amount of data at low costs and low latency. After being written to the storage library, web data needs to be structured and extracted in real-time. Then denoising, word segmentation, and Optical Character Recognition (OCR) must be performed for the extracted data. Sentiment recognition is conducted on segmented texts and images to generate a result set of public opinion data. A traditional offline full computation doesn’t meet the timeliness requirements of the public opinion system.
- When processing data, the computing engine may need to acquire some metadata from the repository, such as user information and sentiments metadata.
- In addition to the real-time computing link, regularly cluster inventory data to optimize the sentiments recognition library or the upstream triggers. Update the sentiments processing rules based on the business requirements and then computes the public opinion on the inventory data based on the new tagged sentiment database.
- Use result sets of public opinion for different requirements. There is a need for a real-time warning for major public opinion events. The presentation layer of complete public opinion results must support the full-text retrieval and flexible attributes combined query capabilities. Business analysis may be performed based on the confidence level, public opinion time, or keyword combinations of attribute fields.
The big data analytics system of public opinion requires two types of computing: real-time computing (including real-time extraction of massive web content, sentiment word analysis, and storage of public opinion results on web pages) and offline computing. The system needs to backtrack historical data, optimize the sentiment dictionary by manual tagging, and correct some real-time computing results. Therefore, you need to choose a system for both real-time computing and batch offline computing in terms of system design. In open-source big data solutions, the Lambda architecture meets these requirements.
The Lambda architecture is the most popular big data architecture in the Hadoop and Spark systems. The greatest advantage of this architecture is that it supports both batch computing, or offline processing, for a large amount of data and real-time stream computing, or hot data processing.
First, the upstream is generally a message queue service, such as Kafka, that stores written data in real-time. Kafka queues have two subscribers:
1. Full Data: It’s the upper half in the image. The full data stores on storage media such as the Hadoop Distributed File System (HDFS). Upon an offline computing task, computing resources (such as Hadoop) access the full data in the storage system for batch computing. After the map and reduce computations, the full data results are written into a structured storage engine, such as HBase, and provided for the business side to query.
2. Kafka Queues Subscriber: It is the stream computing engine that consumes data in queues in real-time. For example, Spark Streaming subscribes to Kafka data in real-time, and stream computing results are written into a structured data engine. SERVING LAYER, labeled as 3 in the preceding figure, indicates the structured storage engine for batch computing and stream computing results. Results are displayed and queried on this layer.
In this architecture, batch computing features support processing a large amount of data, and the association of other business metrics for computing based on the business requirements. Computational logic is easily flexible, which is the advantage of batch computing. According to the business requirements, the results are computed repeatedly without changing the results computed earlier multiple times with the same computational logic. The disadvantage of batch computing is that the computing period is long and the computing results are not generated in real time.
As big data computing evolves, real-time computing is inevitable. Lambda architecture implements real-time computing through real-time data streams. Compared with batch computing, real-time computing has an incremental data streams processing method, which determines that data is often the newly generated data, that is, hot data.
Due to hot data, stream computing meets the organizational low-latency computing requirements. For example, in the public opinion analysis system, you may want to compute public opinion information in minutes after capturing data on a web page. As such, the business side has sufficient time for public opinion feedback.
The following section describes how to implement a complete big data architecture of public opinion based on the Lambda architecture.
An Open-Source Public Opinion Big Data Solution
Figure 3 presents a flowchart that shows different storage and computing systems required for constructing the entire public opinion system, along with different requirements for data organization and query. Based on the open-source big data system and Lambda architecture, the entire system should be designed as follows:
- The upstream of the system is the distributed crawler engine, which captures the source content of a subscribed web page based on the capturing task. The crawler engine writes the captured web content to the Kafka queue in real-time. The data written to the Kafka queue enters the stream computing engine (such as Spark or Flink) in real-time based on the computing requirements described previously, and is persistently stored in HBase for full data storage. A full data storage of a web page allows deduplication during web page capturing and offline batch computing.
- Stream computing structures and extracts the raw web page by converting the unstructured web content into structured data and performing word segmentation. For example, stream computing extracts the title, author, and the short description of the web page and performs word segmentation for the body and abstract, or short description. The extraction and word segmentation results are written back to HBase. After the structured data extraction and word segmentation, the stream computing engine performs sentiment analysis on the web page based on the sentiment dictionary to determine if any public opinion has been generated.
- The public opinion results analyzed by the stream computing engine are stored in the MySQL or HBase database. To facilitate the searching and viewing of result sets, it synchronizes data with a search engine, such as Elasticsearch, for attributed-combined queries. For major public opinions, it writes the public opinion time into the Kafka queue to trigger public opinion alerts.
- The Spark system periodically computes all structured data offline. The sentiment dictionary is updated, or new computing policies are used to recompute historical data to correct real-time computing results.
An Open-Source Architecture Analysis
In the preceding public opinion big data architecture, Kafka connects to the stream computing engine, and HBase connects to the batch computing engine to implement batch views and real-time views in Lambda architecture. The architecture is clear and meets the online and offline computing requirements. However, it is not easy to use this system in production due to the following reasons:
- The architecture involves many storage and computing systems, including Kafka, HBase, Spark, Flink, and Elasticsearch. Data flows across different storage and computing systems. It is a great challenge to operate and maintain each open-source product in the architecture. Failure of any product or tunnel between products affects the timeliness of the overall analysis results.
- For batch computing and stream computing, the raw web page needs to be stored in Kafka and HBase, respectively. Offline computing consumes data in HBase, whereas stream computing consumes data in Kafka, which may cause redundancy of storage resources, maintenance of two sets of computational logic, and increased computing code development and maintenance costs.
- The public opinion computing results are stored in the MySQL or HBase database. To enrich the combination-based query statements, you need to synchronize the data to Elasticsearch. You may need to combine the query results in MySQL and Elasticsearch. The results are written into search systems such as Elasticsearch through databases because the real-time data writing capability and data reliability of search systems are not as accurate as databases. In the industry, databases and search systems are generally integrated, and these resulting integrated systems combine the advantages of both databases and search systems. However, data synchronization and cross-system querying between two engines increase the operations and maintenance and development costs.
A New Big Data Architecture: Lambda Plus
According to the previous analysis, you may wonder whether any simplified big data architecture meets Lambda’s computing requirements and reduce the number of storage and computing modules. Jay Kreps proposed the Kappa architecture.
In short, to simplify two storage systems, Kappa cancels the full data repository and keeps logs in Kafka for a longer period of time. When recomputing and backtracking are required, Kappa resubscribes to the data from the header of the queue and performs stream computing on all data stored in the Kafka queue again. The design solves the pain point between the two storage systems and the two sets of computational logic that needs to be maintained.
The disadvantage is that the historical data reserved in the queue is limited, and it is hard to trace data without time limits. Lambda is an improvement according to Kappa. Assume that a storage engine meets the requirements of efficient writing and random database query, and it’s also similar to the message queue service that meets the first-in-first-out (FIFO) requirements. In this case, you could consider combining the Lambda and Kappa architectures to create the Lambda Plus Architecture.
Based on Lambda, the improvements of the new architecture are as follows:
- In addition to supporting stream computing and batch computing, the computational logic is reused to implement one set of codes with two types of requirements.
- Full historical data and incremental online real-time data is unified and implements one storage system with two types of computing types.
- To facilitate the query of public opinion results, it stores batch view and real-time view data in a database that supports high-throughput real-time writing, multi-field retrieval, and full-text retrieval.
In summary, the core of the new architecture is solving storage problems and flexible connections to computing engines. The architecture of the solution is expected to be similar to the following.
- Write data streams into a distributed database in real time. Compute full data offline in the batch computing system using the database query capability.
- The database can read incremental data through the database log interface and connect to the stream computing engine for real-time computing.
- Write batch computing and stream computing results back to the distributed database. The distributed database provides rich query semantics for the interactive querying of computing results.
In the architecture, the storage layer replaces the message queue service in the big data architecture by combining master table data and logs in the database. For the computing system, choose the batch computing and stream computing engines, such as Flink or Spark. Tracing historical data without limits, as in the Lambda architecture, and process two types of computing tasks with one set of logic and storage as in the Kappa architecture. This architecture is called Lambda Plus. The following section describes how this type of big data architecture is built on Alibaba Cloud.
The Public Opinion System Architecture on Alibaba Cloud
Among the various storage and computing products of Alibaba Cloud, you can choose two products to implement the big data system for public opinion, based on the requirements of big data architecture. The storage layer uses the technologies and computing capabilities of Alibaba Cloud Tablestore, which is a distributed multi-model database, and the computing layer uses Blink to implement stream-batch integrated computing.
This architecture is based on Tablestore in terms of storage. A single database can meet different storage requirements. According to the introduction to the public opinion system, crawler data involves four stages in the system flow. Raw web content, structured data on the web page, analysis rule metadata and public opinion results, and retrieval of public opinion results.
The architecture uses the wide column model and schema-free feature of Tablestore to merge the raw web page and structured data of the web page into one web page. Web data tables connect to computing systems using the new Tablestore feature, Tunnel Service. Based on database logs, Tunnel Service organizes and stores data in the data writing order. This feature enables the database to support stream consumption in queues. As such, the storage engine has random access to databases and sequential access to queues, which meets the requirements for integrating the Lambda and Kappa architectures.
Analysis rule metadata tables consist of analysis rules, the sentiment dictionary group layer, and dimension tables in real-time computing.
It also uses Blink as the computing system. It is an Alibaba Cloud real-time computing product that supports both stream computing and batch computing. Similar to Tablestore, Blink easily implements distributed scale-out, allowing computing resources to elastically expand with business data growth. The advantages of using Tablestore and Blink are as follows:
- Tablestore is deeply integrated with Blink, supporting source tables, dimension tables, and destination tables. Hence, there is no need to develop code for data streams.
- In this architecture, the number of components is reduced to 2 from 6 or 7 while using open-source products. Manage Tablestore and Blink; both offer good scale-out performance. They automatically and easily scale out during business peaks, which greatly reduces the O&M costs of the big data architecture.
- The business side only needs to focus on some of the data processing logic, and the logic for interaction with Tablestore has been integrated into Blink.
- In the open-source solution, if you want to connect the database source to real-time computing, you need to double a queue to allow the stream computing engine to consume data in the queue. In our architecture, databases function as data tables and queue tunnels for real-time incremental data consumption, which greatly reduces the architecture development and use costs.
- Timeliness is essential in public opinion systems for stream-batch integrated computing. Therefore, you need a real-time computing engine. In addition to real-time computing, Blink also supports batch processing of Tablestore data.
In off-peak hours, you may often need to process data in batches and write the data to Tablestore as feedback, such as sentiment analysis feedback. Therefore, it is ideal that architecture supports both stream processing and batch processing. This implies using one architecture to set analysis code for both real-time stream computing and offline batch computing.
The architecture generates real-time public opinion computing results through the computing process. Connect Tablestore to the Function Compute trigger to activate alerts for major public opinion events. Tablestore and Function Compute implement synchronization for incremental data. Writing events into the result table helps using Function Compute to easily trigger SMS messages or email notifications.
The complete analysis results and display search of public opinion use the new feature-the serch index of Tablestore to completely solve the following pain points of the open-source HBase and Solr multi-engine:
- Complex O&M requires the ability to operate and maintain HBase and Solr systems and maintain the data synchronization link.
- The data consistency of Solr is not as good as HBase, and the data semantics of HBase is not the same as that of Solr. Besides, the data consistency of Solr and Elasticsearch is not as strict as that of databases. In some extreme cases, data inconsistency may occur, and it is hard for open-source solutions to achieve cross-system consistency.
- You need to maintain two sets of query APIs, requiring both HBase client and Solr client APIs. For the fields that are not in the index, you’ll need to check data in HBase, which offers poor usability.