Multi-Tenant I/O Workloads Separation
Si Chen, Jianqiao liu, Avani Wildani
Part of provisioning a storage system is understanding how to balance the unique needs of multiple clients with individual service level agreements (SLAs). Since most I/O traces composed of interleaved workload, SLAs are hard to consistently meet when it is difficult to isolate a client’s activity. Moreover, a single client may have multiple, functionally distinct workloads that are not rigorously defined.
We introduced two methods for workload characterization and separation.
Identifying the characteristics of a storage workload is critical for resource provisioning for metrics including performance, reliability, and utilization. Although multi-tenant systems are increasingly commonplace, characterization of multiple workloads within a single system trace is difficult because workloads are highly dynamic and typically not labeled. We show that, by converting a block I/O workload to a signal and applying blind source separation, we are able to successfully separate many application workloads.
If a trace is translated into a signal by binning across the spatial dimension, the problem of separating workloads becomes analogous to separating any set of signals that share a noisy channel. We thus posit that they are amenable to blind source separation techniques such as independent component analysis (ICA) that traditionally are used for signal separation.
Without a standard definition of workloads, we tentatively use PID as a workload identifier. To further simplify the problem, we removed accesses that did not correspond to one of the top 10 PIDs. In this way, we transfer one single trace (with the metrics of PID and corresponding logical block address) into several sub channels of workload signals. Then we use a mixing matrix to simulate I/O contention and scheduling issues between the workloads. Finally we measure the recover accuracy of BSS algorithms, such as FastICA, Algorithm for Multiple Unknown Signals Extraction (AMUSE), Joint Approximation Diagonalization of Eigenmatrices (JADE), Second-Order Blind Identification (SOBI). To validate, we calculate Mean Square Error (MSE) between the recovered signals and the true source signals.
Mixes of I/O workloads are separable using BSS techniques with accuracy over 90% over two datasets. Among the 4 BSS algorithms, JADE might deserve the best BSS algorithm for workload separation with good accuracy and low memory consumption.
Chasing the Signal: Statistically Separating Multi-Tenant I/O Workloads Si Chen, Avani Wildani. In ML for Systems, Montreal, Canada. December 2018. [pdf]
Understanding how many simultaneous tenants are interacting in a shared storage system is essential for SLA satisfaction and resource provisioning. However, due to the volatility of multi-tenant system behavior, existing approaches fail to distinguish interleaved storage workloads on shared systems. We introduce CENSUS, a novel classification framework that combines time series analysis with gradient boosting to identify the number of tenants in a storage workload by projecting its trace into a high-dimensional feature representation space. We show that Census can distinguish the number interleaved workloads in a real world trace segment with an average error of 5–28%.
Predicting an exact number of tenants is a hard task. Although several studies have been performed to identify or disentangle interleaved workloads, no one has successfully separated workloads without receiving the number of workloads out of band..
CENSUS includes two main components: time-series based preprocessing and feature extraction to select workload attributes, and a gradient boosting classification model that gives the final workload number prediction. The input to CENSUS is I/O trace segments of equal length. Features are extracted from the address value and time interval of I/O trace segments within a window. Labeled training data is then fed to a classification model, which predicts the number of workloads for unseen I/O traces. To limit the number of irrelevant features, we judge feature criticality, the impact a feature has on the CENSUS classifier and preserved half of the features. During training, we use the process ID (PID) associated with an individual I/O as a proxy for workload.
We describe the most critical features with the storage system explanation: address complexity, address absolute sum of changes, address change quantiles, and time longest strike below mean.
CENSUS could maintain a average 90% accuracy while extending the classifier to predicting a close approximation of the number of tenants, which implies that CENSUS could capture the fundamental pattern from the I/O trace about the workloads number.
We also derived critical extracted features, which are, crucially, highly dissimilar to prior workload analyses, opening the field to insights derivable from formerly over looked metrics.