流数据统计算法研究-智能与分布计算实验室

流数据统计算法研究

姓名	聂国梁
论文答辩日期	2006.05.09
论文提交日期	2006.05.12
论文级别	博士
中文题名	流数据统计算法研究
英文题名	Study of Data Stream Statistical Algorithms
导师1	卢正鼎
导师2
中文关键词	流数据;统计算法;近似计算;实时算法;概要结构;聚集统计;热门元素;密度统计
英文关键词	data stream;statistical algorithms;approximation algorithms;real-time algorithms;synopsis structure;aggregation functions;hot elements;density estimation
中文文摘	流数据是近年来出现的一种新型的数据模型，在许多应用领域出现频繁，表现形式各异，例如：网络监测时的IP数据包、股票分析时的股票信息、电信公司的通话记录、传感器网络发送的信号等等。与传统的数据模型相比，流数据具有如下特点：实时到达，速率多变；连续到达，次序独立；规模宏大，不能预知其极值；一经处理，除非特意保存，否则不能再次取出处理。这些特点导致了流数据不能全部保存，只能实时地单遍访问。与传统数据库应用系统相比，流数据应用系统往往需要支持连续查询和近似结果统计。因此，无法利用传统数据库技术有效地管理流数据，越来越多的研究人员开始对流数据的相关问题进行研究。流数据的统计反映了流数据的当前状态，在许多决策系统中扮演着重要的角色，同时也是流数据挖掘的一个重要基础。由于流数据的独特性，传统统计算法不适用于流数据。因此，流数据统计已成为一个迫切需要解决的问题。基于滑动窗口模型，给出了一种优化的指数级直方图--松散性指数级直方图和结构维持算法。利用对数空间，该方案解决了流数据的实时近似求和问题，相比已有方法，有明显的时空优势。把松散性指数级直方图应用到流数据计数问题上，充分考虑了流数据的相似度，定义了相似度函数，设计了一种系统框架，给出了一种解决最近N个流数据元素个数统计问题的算法，该算法保证相对误差不大于指定阈值，且流数据相似度越大，时空优势越明显。对于最大（小）值的统计问题，设计了一种算法，该算法利用链式结构，动态维护当前活动窗口中的最大值，针对内存利用过大的情况，给出了压缩策略，该算法利用少量的空间，解决了滑动窗口模型下的流数据最大值统计问题。相比聚集统计，热门元素统计更能准确地描述流数据的当前状态。给出了两种单遍访问算法--梯形过滤算法和波浪筛选算法，解决了滑动窗口模式下的流数据热门元素统计问题，保证统计结果不会遗漏任何满足条件的热门元素。梯形过滤算法应用指数级直方图来统计流数据元素的出现次数，对直方图实行周期性的压缩以删除不需要的元素和统计，该算法尤其适于分布不均匀的流数据，在此情况下，即使滑动窗口的尺寸增加，候选数据集尺寸仍保持稳定。波浪筛选算法通过对子窗口进行周期性地创建和删除，来统计元素的出现次数，其中每个子窗口拥有一个独立的概要数据结构，对每一次查询，该算法保证输出的数据个数不会过多。相比前两种统计，密度估计更能详实地反映流数据的当前状态--流数据的分布特点。基于核心密度估计法，给出了一种适合流数据特点的密度估计算法。该算法利用远远小于数据长度的空间，通过对流数据进行窗口划分，为单个窗口保留少量的分布信息，再综合所有窗口信息，从而对流数据的密度分布进行实时评估。以上的统计都反映了流数据的当前状态，而没有描述出流数据状态的变化情况，为此设计了一种流数据变化检测方案。该方案采用在两个相邻窗口中出现次数变化大的元素来描述流数据的变化：首先，把单个窗口中的流数据划分成若干层，在每层上对元素值域进行分段；然后，在每层上定义若干分段集合，并对分段集合进行求和运算；最后，通过对两个窗口的概要结构进行合并，利用集合分解，求得出现次数变化大的元素，以描述流数据的变化情况。该方案以一定的概率，输出满足条件的元素，而需要的空间却远远小于流数据尺寸。
英文文摘	In a growing number of information-processing applications in recent years, data takes the form of continuous data streams rather than traditional stored datasets. These application areas include network-traffic monitoring, computer-network security, data mining for ecommerce, sensor networks, financial monitoring and many more. Data stream is modeled as an infinite sequence of finite lists elements, and differs from traditional data in four primary aspects: (a) continuity, (b) unknown or unbounded length, which results in it not feasible to simply load the arriving data into a traditional database management system (DBMS) or load it into main memory, (c) velocity variability, and (d) just one-pass access. These characteristics give birth to the conclusion that operation on data stream must support one?pass access to data stream; the results of statistic are approximation; data stream management systems must support continuous queries. Because traditional techniques of DBMS cannot manage database effectively, many researchers and institutes have been investigating the areas of data stream. The statistics over data stream reflects the current state of data stream, and plays an important role in decision support systems (DSS). It is a foundation research to data mining also. Traditional statistics is not suitable for data streams, so it is necessary to study statistic over data stream. This paper studies many questions in statistic over data stream, such as aggregation functions, hot elements finding, density estimating, change detecting and so on. First, this paper studies the aggregation functions of data stream in sliding window. The sliding window model is useful for discounting stale data in data stream applications. In this model, data elements arrive continually and only the most recent N elements are used when answering queries. Basing on sliding window, this paper providing a new data structure--loose exponential histogram (LEH) and the algorithm to maintain the histogram, using space as small as possible, considering the characteristic of data stream, gives the answer to the approximate sum of data stream. Theory and experiments prove that LEH is effective. Counting is a simple summing, but it is different from summing. After studying the distribution character of data stream, this paper defines the evaluation function F, and designs a system framework based on loose exponential histogram (LEH) to estimate the statistics of data stream over sliding window. Using logarithmic memory, the framework can estimate the count in data stream over sliding window of size N with relative error between 0 and ε. Theory and experiments prove that the greater the value of F is, the better the performance of the framework is. As for max (min) function, this paper utilizes chain structure to maintain the maximum over current sliding window. For reducing memory cost, this paper proposes a compression policy to reduce the length of chain effectively. Secondly, this paper brings forward two one-pass algorithms where no hot element will be ignored for another important subject in statistic of data stream--hot elements finding. The first algorithm employs Exponential Histogram to maintain the frequency of elements, and carries out compression process periodically to delete the needless elements. The actual space bounds obtained on experimental data are significantly better than both the worst case guarantees of our algorithm as well as the observed space requirements of earlier algorithms. The first algorithm is well suitable to the skewed data stream. The second algorithm constructs sub-windows and deletes expired sub-windows periodically in sliding window, and each sub-window maintains a summary data structure. The second algorithm outputs at most 1/ε + 1 elements. This paper extends the second algorithm which is designed for constant size sliding window to variable size sliding window. This paper has carried out experiments for both algorithms. Next, this paper investigates the density estimation of data stream. This paper considers the importance of the recent data, and adopts kernel method, proposes an algorithm for density estimation over data stream. Using memory as little as possible, this algorithm divides data stream into sub-windows, and stores a little information for each sub-window. By integrating information of sub-windows, this algorithm can estimate the density of data stream timely. Theory and experiments prove that the algorithm is accurate and effective for density estimation over data stream. Above subjects reflect the current state of data stream, but they do not describe the change of statistic. Last, this paper solves the question of detecting change over data stream. Detecting change of data stream plays an important role in many data stream’s decision support systems. This paper describes the change of data stream by detecting the elements whose frequency difference between two adjoining windows exceeds threshold value. This paper divides single window data stream into several levels, each of which partitions all elements into some groups. Further, this paper defines some supersets over groups, and calculates the sum of each group. After combining sketches of two windows, this paper detects the elements whose value exceeds threshold value by performing binary search. Theory and experiments prove that the algorithm is accurate and effective for detecting change of data stream. Key words: data stream, statistic algorithms, approximation algorithms, real-time algorithms, synopsis structure, aggregation functions, hot elements, density estimation