Double 12 is nothing to say, Ali revealed for the first time the 100 billion-level feature distributed machine learning platform XPS

Double 12 is nothing to say, Ali revealed for the first time the 100 billion-level feature distributed machine learning platform XPS

Alibaba's e-commerce platform has hundreds of millions of users and products, and generates tens of billions of user feedback data every day. For example, guess you like the scene on the Taobao homepage, and there are 10 billion user behavior data every day. Such ultra-large-scale training data has brought huge challenges to distributed machine learning and also introduced interesting research questions.

In 2017, the Alibaba recommended algorithm team and the computing platform PAI team worked together to create the eXtreme Parameter Sever (XPS) machine learning platform, in which eXtreme means "pursuing the ultimate", reflecting our desire to design a machine learning platform with ultimate performance and effects . The XPS platform has been widely used in mobile Taobao s guess you like, Life Research Institute, Fliggy Travel and Tmall Recommendations and other big data scenarios.

On the day of the 2017 Double 11 shopping carnival, the hourly XNN model was launched on the Guess You Like and Tmall Recommendation Scene, stably and quickly using the real-time behavior information of users on Double 11, which significantly increased the revenue and user value on Double 11. In terms of performance, the routine algorithms on the XPS platform can now easily process the characteristics of 10 billion samples and 100 billion samples every day. The algorithm runs fast, has strong fault tolerance and high resource utilization. This article will introduce the overall structure of the XPS platform, and hope to share our experience in distributed algorithm design and framework optimization through these sharings.

In recent years, Alibaba's personalized recommendation and personalized search have brought a good experience to users, and the number of user behaviors has also increased substantially. In particular, the business of mobile terminals is developing rapidly. Both dimensions of users and commodities are showing explosive growth, and the status of users and products continues to change dynamically over time. Under this dynamic and ultra-large-scale data volume, it is very valuable and challenging to build an efficient distributed machine learning platform that accurately predicts the click-through rate and conversion rate of users on the product.

The challenges of large-scale and high-frequency-changing features and samples to the design of distributed machine learning platforms can be summarized in three aspects: samples, features, and distributed scale:

In terms of samples, we are dealing with tens of billions of training data every day, and the cumulative six months of historical training data exceeds the trillions of scale. Obviously, traditional full multi-round iteration machine learning algorithms are no longer suitable for training samples of this size, because this type of algorithm consumes a lot of computing resources and cannot well introduce data timing.

In terms of features, the total number of features in large-scale samples can easily exceed the order of hundreds of billions. The traditional feature numbering method requires a lot of machine resources and a long calculation time to complete the feature numbering, and it is difficult for new features Numbering in time. In terms of resource consumption and time consumption, the feature serialization numbering method is already an unbearable step. In addition, using a method similar to TensorFlow to map features to a fixed range through string_to_hash_bucket, although the shape of the tensor is guaranteed to be fixed and the total number of parameters is reduced, when the total number of features is huge, a large number of hash conflicts are introduced. Affected the algorithm effect.

In terms of distributed scale, the large-scale feature puts huge pressure on Server storage and distributed computing performance. For example, 1 trillion 32-bit float floating-point numbers require 3.63TB of storage space, plus historical gradients that need to be preserved, and often 300 to 600 servers are needed to control the memory usage of each process within a reasonable range. . The number of servers has doubled, resulting in a linear increase in the number of parallel requests, which also brings greater pressure on communication. At the same time, the increase in storage capacity and the number of single-task processes has also put greater pressure on cluster scheduling, fault tolerance, network, and IO.

In the face of these challenges, the XPS platform has proposed many innovative technologies to cope with it, and has taken a step towards the goal of "extreme parameter server":

On the problem of sample processing, we use streaming learning algorithm-based algorithm selection to solve the problem of large-scale samples. Under streaming learning, for each batch of new data, incremental training is directly performed on the current model, and the next model is produced, without loading the full amount of data for multiple rounds of full learning. The selection of streaming learning algorithm balances the problem of data scale and resource consumption, and responds to the problem of large-scale samples more lightly;

In terms of feature processing, a method of mapping feature hashes to hash values is used to replace the feature numbering mechanism. While saving memory and improving performance, it supports the dynamic sparse regularization mechanism of features and the dynamic expansion mechanism that represents the vector dimension, which solves the problem of excessively large total features;

In terms of distributed scale, through mechanisms such as asynchronous Checkpoint and Exactly Once Failover and high-performance ArrayHashMap, coupled with feature processing technologies such as dynamic sparse regularization mechanism, the performance of distributed training is guaranteed and the storage efficiency of Server is improved.

Faced with these large-scale machine learning problems, eXtreme Parameter Server came into being within Alibaba, which specifically solved the challenges of large-scale samples and large-scale features, and has been widely used.

XPS is widely used in Alibaba s internal business scenarios such as Guess You Like, Tmall, Shopping Link, Fliggy, Life Research Institute, Alimama, etc., and has significant effects on user click-through rates, online revenue enhancement, and online user value enhancement.

Below we introduce the system structure and data flow, distributed optimization, core algorithm and operator system of the XPS platform.

1. System structure and data flow

1.1 System structure

The overall structure of the XPS platform is shown in the figure below. In terms of data sources, the bottom layer supports data sources such as OSS File, MaxCompute offline storage, Streaming DataHub and Kafka. Users complete offline data processing and XPS algorithm call at the same time on MaxCompute. Cluster scheduling uses Ali Group s Feitian cluster scheduling, which can effectively avoid resource preemption and efficiently use computing resources; at the algorithm level, XPS provides algorithms such as XNN/XFTRL/XSVD/XGBOOST/FM; at the business level, we support recommendations, Business scenarios such as advertising and search.

The business side uses the SQL in MaxCompute to call the algorithm of the XPS platform, configure and select the required algorithm and optimization operator, and can quickly complete the construction of the training task. The estimation service will grab the model produced by XPS and provide it for online estimation service. XPS provides a perfect fault-tolerant mechanism, the user will be automatically re-trained if the task fails.

1.2 Data flow

At present, the XPS platform has established a complete data flow solution within Alibaba. We have cooperated with various teams in Alibaba Group to form a data flow from training data production, feature engineering, model training, model evaluation, model deployment, and model scoring. The overall data processing flow of XPS is shown in the figure below.

In the data production, feature engineering, model training and model evaluation phases, we adopted Ali's MaxCompute offline storage system. A single task on the platform can easily handle tens of billions of training data per day, and feature learning of hundreds of billions of features. By supporting large-scale features, the data laws contained in the data can be fully mined. Model training uses streaming scheduling. Under streaming scheduling learning, each time you learn, you only need to load the previously stored model, input new samples for training to produce a new model, and deploy it online for estimation. On the whole, streaming learning saves a lot of computing resources compared to full computing.

On the estimation server, we use Ali s internal rtp-xps service, we convert the model into tf format, and use tf-serving for the estimation service. In terms of model evaluation, we have developed our own conformance testing solutions and tools based on rtp-xps and MaxCompute. Combining tf-serving can achieve rapid feature and model conformance testing. In feature engineering, during the development of XPS, we developed a set of high-performance SessionServer services. SessionServer extracts the user s previous behavior data for a period of time. These data can better help us understand the user and also capture the user s real-time Behavioral characteristics.

2. Distributed optimization

In order to be able to support hundreds of billions of features and trillions of samples, we have made special optimizations for asynchronous communication in distributed machine learning scenarios. The XPS framework has independently developed a high-performance communication framework. In the asynchronous communication scenario under the distributed machine learning scenario, the performance of the traditional MPI communication is improved by 1 to 2 times. Through these optimization methods, the number of XPS Servers can be expanded horizontally to 800, completing machine learning tasks with a scale of hundreds of billions of features and trillions of parameters. The distributed optimization technology specifically includes the following aspects:

2.1 Feature hashing

The XPS framework supports directly inputting the hashed feature ID, or automatically calculating the hash value for the input string feature. In this way, only the hash value is stored in the entire calculation process and the output model. We connect with the online inference service. Input samples, output models, training iterations, and communication can directly hash and hash feature IDs. In actual online model training, half of the memory is saved, and the performance is also doubled.

2.2 Dynamic feature scaling

In the streaming learning scene, a notable feature is the fast-changing dynamic characteristics. New features are added all the time, and old features are deleted. After the introduction of real-time automated conversation features, hundreds of billions of features will be inserted and deleted in a relatively short period of time. In this case, the underlying framework is required to support high-frequency, dynamic feature additions and deletions and communication. The XPS framework has made special optimizations for high-frequency addition and deletion feature scenarios. The XPS-ArrayHashMap is redesigned and implemented, and the realloc/mremap method is used to manually manage memory. The performance of inserting and deleting KV is significantly higher than std::unordered_map, google::DenseMap, etc. Hash table implementation; more importantly, XPS-ArrayHashMap supports direct zero-copy communication, eliminating serialization and deserialization steps.

2.3 Global Checkpoint and Exactly Once Failover

The total storage space of trillion parameters can reach the level of 10TB. With storage space requirements of this magnitude, it is a common requirement to use 400-800 servers. And a large number of server processes and worker processes bring high stability and scalability pressure to the distributed parameter server. In response to this scenario, the XPS framework supports multiple Worker parallel loading and output models, asynchronous Checkpoint and asynchronous Failover mechanisms, and can automatically recover with zero perception of node failure.

Different from an independent HPC cluster, the Feitian cluster of Alibaba Group has a relatively high probability of encountering individual node failures when the number of parallel nodes is large. The XPS framework supports a complete Failover function. It supports Exactly Once Failover on all kinds of streaming data sources and offline data sources, which can restore the data flow and model to the moment before the node fails, so that the node will not be restarted. Loss of data and no double counting.

2.4 High concurrent communication

The XPS framework has independently developed a high-performance communication framework. In the asynchronous communication scenario under the distributed machine learning scenario, the performance of the traditional MPI communication is improved by 1 to 2 times. For large-scale feature scenarios, the communication layer of XPS can support zero-copy transmission and reception of the sparse matrix, and combine communications through the sparse layer and the dense layer respectively, reducing the number of communication packets and reducing the communication delay. Through these optimization methods, the number of XPS servers can be expanded horizontally to 800, which can support the storage of hundreds of billions of features and trillions of parameters during the training process.

2.5 Represents learning optimization

Learning the representation vector of sparse features is the most important for the calculation of sparse features and communication optimization. We have deeply optimized the calculation of Embedding matrix under sparse hash features to optimize the performance of representation vector learning. Specifically, although the representation vector group obtained through the Pull communication operation represents a sparse matrix, all the representation vectors are in a continuous buffer. This communication buffer data can be used to directly construct an ArrayHashMap without memory copy. Through the sparse feature index interface implemented by such a hash table, the representation vector of each sparse feature passed in communication can be accessed with O(1) time complexity during calculation. In this way, the sparse representation vector group can be converted into a dense format without being converted into a dense format. It is used efficiently, eliminating the overhead of creating an intermediate dense matrix, and greatly speeding up the embedding matrix calculation of minibatch.

3. the core algorithm

Large-scale training samples, full multi-round iterative algorithms, even if they are optimized more efficiently, it is impossible to avoid the problem of retraining full data every day. Obviously, this is no longer suitable for the development of big data. The selection of streaming learning algorithms can balance the problem of data scale and resource consumption, and easily cope with the learning problem of large-scale samples. The distributed machine learning platform with streaming learning algorithm as the core has many interesting optimization content in terms of algorithm parameter adjustment, algorithm evaluation, and algorithm monitoring.

XPS aims to design an efficient streaming learning algorithm platform in super-large-scale data scenarios. We currently mainly design linear algorithm XFTRL, bilinear algorithm XSVD and deep learning algorithm XNN.

3.1 XFTRL algorithm

The XFTRL algorithm is an extension of the classic linear algorithm FTRL, and is proposed to solve some of the shortcomings of FTRL under large-scale data. When we used the FTRL algorithm, we encountered many numerical problems and stability problems. In order to solve these problems, we designed several optimization points:

1. introduce regularity to prevent singular weights. In practice, we found that the FTRL algorithm is prone to sudden increase in weight during operation. By introducing the second regularity of the variable z in FTRL, the phenomenon of sudden excessive weight can be effectively suppressed;

2. the introduction of weighted version control and gradient averaging makes the update smoother. When updating the gradient, we introduced a discount mechanism for the difference between the pull weight and the weight version of the Push gradient to alleviate the problem of inconsistency between the weights of the worker and the server when the gradient is updated in a distributed manner. We further averaged the gradient sum on the server side under minibatch to make the model weight update process smoother;

3. introduce the parameter decay mechanism under streaming learning. We multiply the w, z, and n variables in the FTRL algorithm by an attenuation coefficient after each update, so that the weight of the entire model is biased toward the latest data, so as to better capture the timeliness of the data.

3.2 XSVD algorithm

The XSVD algorithm is an algorithm proposed by improving SVD++ for the three core elements of "user", "product" and "historical behavior product" in the e-commerce system. SVD++ is a classic algorithm in the recommendation field, but we rarely see solutions in large-scale data scenarios. Our core motivation is to solve the learnability problem of the SVD++ algorithm under trillions of samples. Under the keynote of the algorithm design of streaming learning, we need to transform the SVD algorithm into a model that can be learned in streaming scenarios. It is easy to imagine that for the implicit vector accumulation item of feedback products in SVD++, only the first K behaviors of the current behavior are taken as feedback items, and then streaming learning can be realized. Secondly, in order to make the XSVD algorithm also have the expressive power of the LR algorithm, we also take the user, product and conversation features as common features, and do joint learning with the hidden vectors. In addition, in order to enrich the expressive power of XSVD, we also introduced Ali s internal SLRM algorithm's "model characterization" transformation idea in XSVD, by mapping the product pair-wise relationship that needs to be learned in the SLIM algorithm to the feature of learning product pair-wise The weighting method introduces the SLIM idea in XSVD.

3.3 XNN algorithm

The XNN algorithm is our deep learning algorithm, and its structure is shown in the figure below. XNN mainly includes input layer (InputLayer), transformation layer (TransformLayer), product activation layer (MultiActiveLayer) and output layer (OutputLayer).

The InputLayer of the XNN network processes the discrete, combined and continuous features of the input. Each one-hot coded feature will be short and long coded, and then accumulate (reduce_sum) according to the feature group. TransformLayer performs various normalization changes on the input layer InputLayer and then pushes it to MultiActiveLayer. MultiActiveLayer performs layer-by-layer matrix multiplication and activation operations, and finally the top layer is output after the Sigmoid activation operator acts. The input data of XNN is organized according to feature groups to reduce the cost of cache miss when doing reduce_sum. Matrix operations use a mixture of Eigen matrix library and CBlas matrix library to balance the simplicity of the data interface and the efficiency of calculation.

Compared with classic deep learning algorithms, it has the following advantages:

Dynamic feature sparse regularization. We implement dynamic addition and dynamic deletion of features according to the dynamic utility value Utlity(f) of each feature f. When Utility(f) is greater than a certain threshold, we create features and learn feature weights. When it is less than a certain threshold, we erase features and delete feature weights. . The design idea of dynamic features effectively controls the total amount of learnable features; the
dynamic expression dimension expands. According to the dynamic information volume Infomation(f) of each feature f, we assign different hidden vector dimensions to different features, compress important features with high-dimensional compression, and compress unimportant features with low-dimensional compression, which improves feature value and memory usage efficiency. .
Automated session feature modeling. We automatically count long-term, mid-term and short-term conversational features in the model, and add conversational features to the neural network for joint learning of feature statistical values and implicit expressions. The idea of automated conversational features will be introduced in detail in the operator design.
Regular frequency division. We use different regular coefficients for features of different frequencies to avoid local over-fitting and maximize test accuracy. In terms of performance, I guess you like the full amount of data accumulated in the scene, including 100 billion-scale features and trillion-scale samples. A single sample has an average of 100 features. Under the resource overhead of 5600 cores and 16TB memory (mixed with MR tasks) Common CPU cluster), the XNN algorithm can complete the learning of tens of billions of streaming samples with daily increments in 7 hours.

XFTRL, XSVD and XNN algorithms are new and unique algorithms on the XPS platform. The XPS platform also supports the following classic algorithms: XGBOOST, Factorization Machine, OWL-QN, Word2Vector, etc. In the future, we will continue to expand the set of XPS algorithms and propose more innovative algorithms. We will further study more engineering features under streaming learning, such as decentralized design, intelligent processing of slow machines and optimization of communication layer mechanism, and establish a more complete streaming learning machine learning platform mechanism.

4. the operator system

In the process of developing XPS, we abstracted some general algorithm ideas and gradually formed the XPS operator system. When using the algorithm users of the XPS platform to design a new algorithm, after designing the algorithm, they only need to select the operators in the system to match, and then the algorithm development can be completed. The design of the operator system not only accelerates the algorithm construction process, improves the efficiency of algorithm development and debugging, but also enhances the expression ability of the algorithm. We mainly designed the following operators:

4.1 Streaming evaluation operator

For the streaming training data of XPS, we designed a streaming evaluation framework, which greatly accelerated our parameter adjustment work and effectively helped us to verify the correctness of the model. The specific idea of the streaming test operator is also very simple. After M batch training, we will evaluate the data of N batches in the future and observe the AUC, MAE, RMSE, maximum weight and sum of the estimated data. Minimum weight and other indicators, while paying close attention to the estimated accuracy PCOPC (Predict Click Over Post Click). Such an evaluation mechanism greatly improves the efficiency of parameter debugging. At the same time, at the end of the training, these indicators also reflect the quality of the model. The training framework will check these parameters, and only when they meet a certain correctness standard will the model be output to the predictive service module for model deployment.

4.2 Automated Session Feature Operator

In the click-through rate estimation and conversion rate estimation scenarios, the importance of the user's session characteristics in the recent period of time is very high. In a general system, the SessionServer used to provide session features can only provide statistics in a certain dimension. We put forward the design idea of "Feature Modeling" of Automatic Session. Specifically, the model performs data training while counting the exposure and clicks of each feature within a certain period of time, and then calculates the click rate of each feature in different cycles, and adds it to the training features and other features to train together. The addition of Automatic Session has greatly enriched our feature system. We provide three attenuation coefficient items, long, medium and short. Users can configure these three attenuation coefficients in any XPS algorithm to introduce automatic conversation feature operators to enhance the expressive ability of the model itself.

4.3 Gradient average operator

The learning of features with too low frequency can easily lead to excessive dispersion of model weights and the introduction of overfitting. Features with too high frequency are also prone to overfitting due to local over-update. Whenever the server side updates the weight, it will discount the gradient in different ways according to the feature frequency.

4.4 Asynchronous update control operator

The asynchronous update control operator performs gradient discounts based on the version difference between the Pull model weight and the model weight during Push gradient, which prevents the efficiency loss of asynchronous update. A gradient update value with a low version difference has a higher confidence level, and a gradient update value with a high version difference has a lower confidence level.

In addition to the above operators, there are activation function selection operators, regular selection operators, variable decay operators, and safety check operators in XPS.

For XPS algorithm development users, on the basis of providing efficient distributed scalability, we have also abstracted a set of SDK for algorithm developers, shielding users from complex distributed scheduling, communication and synchronization, asynchrony and other low-level details . Algorithm development users only need to consider the overall flow of the algorithm, the logic of calculating the gradient on the Worker, and the logic of updating the model on the Server, and select the operator in the embodiment of the algorithm operator to develop an algorithm with hundreds of billions of characteristics. The construction process of a new XPS algorithm generally includes the following 9 steps:

Complete the gradient calculation code on the Worker side of the new algorithm;
choose to add automated session features to enrich the feature system;
add a streaming test operator to facilitate quick parameter adjustment and model monitoring;
select the gradient average function to maintain the smoothness of the
update ; select the asynchronous update control algorithm The sub-version difference controls the function type; the
activation function is selected by the activation function selection operator; the
regular method is selected by the regular selection operator; the
variable attenuation value is selected, the variable attenuation operator is activated, and the variable flow attenuation is
selected ; security is selected The check operator performs safety check and safety control on the gradient value, the updated value, and the maximum and minimum values of the model.

Through the abstract design of XPS operators, algorithm development users have more flexible choices for algorithm optimization operators. In this way, users can concentrate on researching innovative algorithms without ignoring technical details that they don't need to care about. In Alibaba Group, the XPS platform has helped users develop new SLRM algorithms, and other new algorithms are also under study.

After a lot of practice, the eXtreme Parameter Server platform has become a new generation of distributed machine learning system for Alibaba Group to solve the problem of super-large-scale sample and feature learning. In 2017, XPS was widely used in Alibaba Group's search, advertising, and recommendation scenarios, and the number of algorithm development users was also growing rapidly. In 2018, we will build image and NLP algorithms in XPS; we are also improving the TensorFlow programming model, and will support users to use python programming to access TensorFlow, so that users can quickly write their own models and call the efficient XPS Sever function ; We are also integrating the reinforcement learning algorithm framework and introducing AliBasicFeatureServer, the basic feature service system of Ali.

The fast-growing Ali Group will encounter more big data research problems in the future. These problems are challenging and fascinating. XPS will move forward firmly under the design concept of extreme parameter server. Fan Chaosheng, head of XPS algorithm, and Chen Xu, head of engineering, said: The goal of eXtreme Parameter Sever is to design a distributed parameter server that pursues the ultimate performance and effect. We have taken a step forward. There is still a long way to go in the future. We will work hard to use it. AI technology changes our lives.

This article provides a preliminary introduction to the XPS platform. For more details, please pay attention to the papers published by the XPS team. Taobao Technology Official Account will also be released as soon as possible, so stay tuned!

Long press to identify the QR code, pay attention to Taobao technology