Improved K-means Algorithm for Manufacturing Process Anomaly Detection and Recognition

期刊名字：武汉理工大学学报
文件大小：313kb
论文作者：ZHOU Xiaomin，PENG Wei，SHI Haib
作者单位：Shenyang Institution of Automation Chinese Academy of Sciences,Graduate School
更新时间：2020-11-11
下载次数：次

论文简介

Improved K-means Algorithm for Manufacturing Process AnomalyDetection and RecognitionZHOU Xiaomin' 2 PENG Wei' SHI Haibo'( 1. Shenyang Institution of Automation Chinese Academy of Sciences , Shenyang 1 10016 China E-mail xmzhou@sia. cn ;2. Graduate School , Chinese Academy of Sciences , Beiing 100039 China )Abstract : Anomaly detection and recognition are of prime importance in process industries. Faults are usually rare ,and，therefore , predicting them is difficult. In this paper ,a new greedy initialization method for the K-means algorithm is proposed toimprove traditional K -means clustering techniques. The new initialization method tries to choose suitable initial points , which arerwell separated and hare the potential to form high- quality clusters. Bused on the clustering result of historical disqualificationproduct data in manufacturing process which generated by the Improved- K -means algorithm ,a prediction model which is used todetect and recognize the abnormal trend of the quality problems is constructed . This simple and robust alarmn-system architecture forpredicting incoming faults realizes the transition of quality problems from diagnosis afteruard to prevention be forehand indeed. Inthe end , the alarm model was applied for prediction and avoidance of gear. wheel assembly faults at a gear. plant .Key words: data mining ; clustering ; quality management ; anomaly detection and recognition1 IntroductionFor an enterprise , quality is the life of product and service so implementing the precaution principle is thecore and distillation of modern quality management. The current lag fault diagnosis method is of lttle use to thereal-time manufacturing process quality-control. When the product disqualification is detected , the loss is irre-trievable and this affects the quality and efficiency of production greatly. So how to recognize the early failuresymptom and performance falling-trend , then take a corresponding fault-elimination action beforehand throughmonitoring the process information and product characteristic information becomes more and more important.This manufacturing process excursion forecasting manner is becoming a key step to avoid losses and create a goodproduct reputation.Gearbox is an important component of vehicle drivetrain system and almost 60% gearbox-faults are causedby gear-wheel ,so it is very important to monitor the gear-wheel assembly process and detect the abnormal trendin the gearbox assembly-line. Based on a mass of gearbox performance testing data from a gear plant , this paperanalyzed the gear-wheel assembly process with clustering technology first , then explained the clustering result indomain- expert' s help and formed an anomaly. analysis decision table. To find the hidden quality problem and toprovide abundant decision- making information for the enterprise quality control , the status of each work- stationis monitored with the help of the anomaly forecasting model.2 An Improved K-means Clustering AlgorithmData Clustering is an important technique for exploratory data analysis. Clustering techniques are used forcombining observed objects into clusters which satisfy the criteria that each cluster is homogeneous and should bedifferent from other clusters. K-means is a traditional simple and effective clustering method in common use ，however , there are problems with such a technique :a ) K-means requires the number of clusters to be specifiedbeforehand , but determining the number of clusters is not easy. b ) K-means requires one centroid for each clus-ter , these centroids shoud be placed in a cunning way because of different initial centroids cause different results.c) The abnormality data may be very large , and this may affect the estimation of the data distribution , so thisalgorithm is very sensitive to the abnormality data. To overcome these defects as much as possible , this paperproposed an Improved- K-means clustering algorithm.2.1 Basic Idea of The Improved Algorithm中国煤化工We' II give you several definitions before describing the id.MYHCNMHGhm:Definition 1 :( ε -neighborhood of a point ) The ε-neighborhood ofapoint p ,denoted by N( p ) , is definedby N( p)= {q∈D|dis( p q)≤e}Definition 2 :( sparse-point ) If the number of points in a e-neighborhood of a point is less than the giventhreshold value MinPts , then this point is called sparse- point , otherwise it is called non- sparse point._ 103Definition 3 :( clustering merging rule ) The cluster center of cluster C; is O; , the cluster center of clusterC;isO{( j≠i ),the center of all the points both in C; and in C; is Ok , then cluster C; and C; satisfty the clus-tering merging ruleif 21p-O;|+ 2|p-0;2q+x2p-O%| (λ∈(0.75 1.5 )). Otherwise ，they don' t satisfy this merging rule. .Definition 4 :( cluster radius ) The maximum distance from points in a cluster to the cluster center.Definition 5 :( the subjection degree of a point to a cluster ) Suppose the center of cluster C; is O; , the clus-ter radius is R , the subjection of a random sample point to cluster C; is :Sul(X ,C)=?。-(r-R312(1)In this formula :r= |X- Ok| , the smaller the σ-value is , the steeper the GAUSS Function is , in general , theσ-value is between 0 and 0.5( σ∈(0 0.5 )).Definition 6 :( the improved maximum- likelihood-classification ) Let the two biggest subjection degree ofsample point X be Sul( X ,Ca )=a andSul( X ,C;)=b ,X belongs to cluster Ca if a and b satisfying the con-dition that a≥( 1 + μ )b otherwise, X belongs to a new cluster. The value of μ is chosen acording to thestrictness of classificatory judgement , they are in direct ratio( commonly μ∈[0 ,1 ]).( 1 ) Because the initialized cluster centroids are crucial to the clustering result of this algorithm , the resultwill be more reasonable and the convergence rate will be faster on the base of the reasonable initialized clustercentroids. So ,the better choice is to place them as much as possible far away from each other. In order to realizethis , we can adopt the following strategy : choose the non-sparse- point which has the biggest number of points inits e-neighborhood as the first cluster centroid , and remove both this point and its e-neighborhood points fromthe initial dataset. Then take the rest data points as the initial dataset and choose the second cluster centroid inthe same way. Acording to this rule , choose the third cluster centroid , the fourth，and so on.The next step is to take each rest point belonging to a given cluster represented by the nearest cluster cen-troid. When no point is pending , the first step is completed and an early groupage is done. At this point we needto merge the clusters acording to the clustering merging rule as much as possible. Then we should re- calculatethe new centroids as barycenters of the clusters resulting from the previous step. Repeat these steps untill all theclusters don t change any more.( 2 ) For a cluster- unknown sparse sample point , it' s usually difficult to identify whether it belongs to a newcluster or to an already existing cluster. And it is likely to take a cluster-unknown sample point which belongs toa new cluster to an existing cluster if adopted the traditional Maximum-Likelihood-Classification. Researches onHuman-Classify-Mechanism found that if the similarity degree of a need classified sample-point to one class isoverwhelming bigger than other classes , then this sample-point can be partitioned into this class. Enlighten bythis , we improved the MLC rule to avoid the improper partition and proposed the concept of Improved-MLCrule.( 3 ) Because the data of some abnormality sparse -points may be very large and this may affect the estima-tion of the data distribution , we can separate these sparse-points at first , then assign the rest data points to theright cluster. After all these steps have been finished , you can process these sparse-points according to the Im-proved-MLC and identify whether it should be assigned to a new cluster or to an already known cluster.2.2 Steps of The Improved AlgorithmIn the following , we present the basic steps of the algorithm based on the previous analysis :Input : input parameters ε , MinPts and the database included n data pointsOutput : K clusters satisfied the Lowest- V ariance Criterion( 1 ) Analyze the E neighborhood of each point in the given dataset , and separate the sparse data points fromthe non-sparse data points.( 2 ) Repeat step(3 )to( 4 )untill we can' t choose any more cluster centroids.( 3 ) Pick up the non-sparse data point which has the biggest number of points in it' s E -neighborhood as thefirst cluster centroid.中国煤化工( 4 ) Remove both this point and the e-neighborhood pointC N M H Gset. Then take the restdata points as the initial dataset.( 5 ) Repeat step( 6 )to( 8 ) untill all the clusters don' t change any more.( 6 )Calculate the distance between each rest data point and each cluster center , assigning these rest datapoints to the closest cluster according the minimum distance criterion.一1037-( 7 ) Merge the clusters according to the clustering merging rule as much as possible.( 8 ) Recalculate the new cluster center for each cluster.( 9 ) Assign the sparse data points separated in step( 1 ) according to the Improved-MLC , identify whetherit belongs to a new cluster or to an already exist cluster.For the input parameter K is critical to the clustering number and the rationality of clustering result ,andsetting the parameter K lies on domain expert' s experience greatly , the traditional K-means algorithm is appliedat a discount. Though the Improved-K-means algorithm in this paper imports two parameters ε and MinPts ,they have much less effect on the clustering result. The clustering number self study function can reduce the ex-cessive reliance on input parameter K during clustering analysis.In addition , the irrational initial center which is needed by K- means algorithm will lead the algorithm to lo-cal optimization easily. This improved K- means algorithm can get a better clustering result by a special way toassure the initial points be placed dispersedly enough. Though adding some steps , the loop number will fall downfor the more rational initial center and the time complexity of improved K -means is more or less equal to the tra-ditional one.3 Approach and Strategy AnalysisA great deal of product-performance-parameters variances are usually caused by manufacturing process pa-rameters variances which always changed slowly and gradually. These parameters variances can be divided intomany types such as material differences , the stability of the machine equipment , the variance of the environmentconditions , and the change of the operator. In manufacturing process ，the product performance parameters varyin a local scope and obey a certain distribution normal distribution while there are only casual factors exist.However , when there are some systemic factors such as the deteriorating of machine equipment , low-level skillof the operator and the low passing rate of the material , the performance parameters may depart from the origi-nal distribution although they still stay in normal scope. This aberrancy means that the manufacturing processmay have some hidden quality trouble in reverse.In order to realize the anomaly trend detection , you need an integrated system to unite the interrelated tech-niques organically so as to exert the whole function. The analysis system proposed in this article can divide intothree parts :the data standardization and attribute weight analysis part , the clustering analysis part，the anomalydetection and recognition part. The whole analysis system flow-chart is depicted in Fig. 1.Online data collectionFaul -sampledatabaseStand;ndardization processStandardizationγNammaliteClustering analysis andexplanationNAnalytical decision tableStatistical layerAnomaly detection and alarm generationFig. 1 Clustering- based anomaly detection and recognition system flow-chart3.1 Data Standardization Process and Attribute Weight AnalysisChoose correlative operation data for the particular problem after preparing the sampling data step , and pre-process these data such as : fllup the missing data , smooth th中国煤化工bnormal data. Supposethe collected sampling dataset is x={x1 x2... xn }, eachYHCN M H G is characterized by thevariable group x;={x;i x江.. rxij .. xTim } In this formula xj represents the jth characteristic-parameter ofthe ith sample.Because the importance of different characteristic parameters to problem analysis is different ,it is necessaryto establish the weight and give a weight-value to each parameter. The w eighted Euclidean Distance from x; to- 10x; is:dis(λ ), then we call that sample X is near tocluster Ck. The value-setting of λ should refer to the losses caused by corresponding quality problem , the greaterthe loss is , the smaller λ value we should set. Usually λ is set to 0.75-1.Definition :( extensible-object) For a testing sample y中国煤化工owing conditions aboutclusterCk :MHCNMHGa)Proximity( X→Ck )>λb)a2( 1+ μ )b ,whilea(Sul( X ,Ck )=a )and b( Sul( X ,C )= b )are the two biggest subjection degreeof sample point X and a≥b.Then we can call that testing sample X is an extensible-object of cluster Ck.- 1039 一In this statistical layer , we should identify all the cluster- concerned extensible objects from the collectedtesting sample-set , then draw the extensible object distribution histogram and turn to the analysis and alarmgeneration layer.Level 3 ) analysis and alarm generation layer : When forecasting some discrete factors of the latency quality-problems such as online-worker' s low-level skill online-worker' s weakness of quality awareness or low passing-rate of the material , these discrete factors occurred repeatedly though they don' t contain the deteriorating- trendin time dimension , so we can get the statistical extensible- object number of each cluster and draw the distribu-tion histogram , then generate the anomaly-trend alarm corresponding to the cluster which has the biggest fre-quency.Fig. 2 represents a distribution histogram of extensible-orobjects. There are thirteen extensible objects in testing8sample set , the distribution frequency of cluster C1 , C2 ,C3is8 ,5 ,1. Of course , we should generate the anomaly-Number ofexlensible-trend alarm of cluster C1 and take some corresponding pre-objcctsventive-maintenance action.24 ExperimentCluster C,Cluster CqChusterC,Based on six months gearbox- performance testing dataFig.2 Extensible-objects distribution histogramfrom a gearbox assembly line of a gear plant , we extract120 pieces of data appearing malfunction symptom in the third-gear. The tested characteristic parameters are :the maximum noise value in accelerating process after shifting gears，the real gearing rate，temperature , the val-ue of odograph , the peak value of shifting gears power-curve. These five attributes formed the characteristicvariable group. The weight and standard value of each attribute displayed in Table 1. .Table 1The weight and standard value of each attributeAttributeNumWeightStandard valueNoise10.35<90 dbGearing rate0.256. 163Power-curve peak-value0.2100- 180 NTemperature0.140 C-60 COdograph value85.21The anomaly- analysis decision table based on clustering result is showed in table 2.Table 2 Abnormal-analysis decision tableClusteringClusterFault stationMalfunction reasonsCountScalecenterO1 :「97Middle axle assem- Disqualification of the gear wheel in middle axle or abnor-310.4636. 157bly station or gear-mality of middle axleC210box coping assembly71.3stationGear wheel assembled too tight240.358L84.73」There are burrs on gear sleeve0. 179O2 :836.01Odograph assembly Abnormal fault in odograph( mainly comes from active gearC2191.0127and driven gear )49L69.36_O; :Abnormal distortion of shifting forkl50.4416.157Gearbox copingC3287assembly stationInflexible of switch rocker中国煤化工70. 20652MYHCNMHGL85.02Shift gears fork -axle flexure0.353Test the normality of noise , gearing rate and power curve peak value online. Once either of them shows thedeviation from local normal distribution , get the characteristic variable group of the continuous 15 products andmake out extensible-objects( set the proximity threshold value λ to 0.8 ). Then analyze the distribution of ex-- 104方数据.tensible- objects and generated the corresponding alarm. For example , there are 9 extensible-objects ,6 in C1 ,1in C2 ,2 in C3. So we should generated the alarm of C1 ,and take some action to strengthen the quality controlof middle axle assembly station , remind the station-workers to check on the material strictly and avoid assem-bling the gear-wheel too tight. The levels of prediction accuracy summarized in Table 3. This means that 93 %of C1 ,89% of C2 ,and 82% of C3 anomaly trend were correctly predicted. .Table 3 Summary of Prediction AccuracyCorrectly PredictedWrongly PredictedC193%7%89%11%82% .18% .5 ConclusionsThis hidden quality trouble forecasting manner is becoming a key step to avoid losses and enhance the overallcompetitiveness of manufacturing plants. This paper applied the business intelligence techniques( i. e. data ware-house and data mining ) to quality management , proposed a simple , robust , and real-time system for early pre-diction of faults. Data-mining algorithms produced easy-to-interpret multiple rule sets , which were employed bythe hierarchical decision making model to predict faults. The anomaly detection module captures the temporalfault patterns , thus increasing the chances of predicting abnormal trend and issuing advance warnings. The de-veloped system is effective and the modular- architecture of the developed system allows incorporating alternativeknowledge- generation-modules such as other data-mining approaches , analytical models , or domain knowledge.The alarm system was successfully applied to the data from a famous gear plant. Independent test data sets wereused to validate the developed system.References[1 ] R. M. Gardner ,J Bieker , and S. Elwell. Solving Tough Semiconductor Manufacturing Problems Using Data Mining. ASMC2000 Proceedings , 46- 58.[2] Andrew Kusiak , Shital Shah. Data-Mining Based System for Prediction of Water Chemistry Faults. IEEE Transactions on In-dustrial Eletronics , 2006.4532).[3] Wei Zhong ,Gulsah Altun , Robert Harrison , Phang C. Tai , Improved K- Means Clustering Algorithm for Exploring Local Pro-tein Sequence Motifs Representing Common Structural Property. IEEE Transactions on Nanobioscience , 2005.9 A(3).[4] H. Bae,S. Kim, Y. Kim,M. H. Lee ,and K. B. Woo. E Prognosis and Diagnosis for Process Management Using DataMining and Artificial Intelligence. Proc. Industrial FElectronics Conf. ( IECON), Roanoke, VA 2003 ,3.中国煤化工MHCNMHG一1041--

论文截图