Implicit Parallel FEM Analysis of Shallow Water Equations

期刊名字：清华大学学报自然科学版（英文版）
文件大小：296kb
论文作者：JIANG Chunbo，LI Kai，LIU Ning，Z
作者单位：Department of Hydraulic and Hydropower Engineering,Ministry of Water Resources
更新时间：2020-07-08
下载次数：次

论文简介

TSINGHUA SCIENCE AND TECHNOLOGYISSN 1007-0214 14/21 pp364-371Volume 10， Number 3，June 2005Implicit Parallel FEM Analysis of Shallow Water EquationsJIANG Chunbo (江舂波)“,LIKai(李凯), LIU Ning (刘宁)， ZHANG Qinghai (张庆海)Department of Hydraulic and Hydropower Engineering, Tsinghua University, Bejjng 100084, China;十Ministry of Water Resources, Beijng 100053, ChinaAbstract: The velocity field in the Wu River at Chongqing was simulated using the shallow water equationimplemented on clustered workstations. The parallel computing technique was used to increase the comput-ing power. The shallow water equation was discretized to a linear system of equations with a direct parallelgeneralized minimum residual algorithm (GMRES) used to solve the linear system. Unlike other parallelGMRES methods, the direct GMRES method does not alter the sequential algorithm, but bases the paral-lelization on basic operations such as the matrix-vector product. The computed results agree well with ob-served results. The parall computing technique significantly increases the solution speed for this large-scale problem.Key words: shallow water equations; parallel computing; generalized minimum residual algorithm (GMRES)parallel computing not only allows the use of a muchIntroductionlarger number of elements, but also drastically reducesThe Three-Gorges Project in China has created manythe time required for the computation.This paper solves the shallow water equations as theengineering problems, one of which is that the pollu-tion distribution in the Yangtze River will be totallygoverming equation to model the river flow. For thechanged, so the pollution distribution must be accu-parallel computation, the mesh is automatically parti-rately predicted to properly control the discharges. Thetioned using the geometric mesh partitioning method'The governing equations are then discretized implicitlyfirst step is to predict the velocity field at key points.However, the large-scale analysis of the velocityto form a large sparse linear system, which is solvedfield is a daunting task because no matter how goodusing a direct parallel generalized minimum residualthe algorithm, the computational capacity of traditionalalgorithm (GMRES). The program was tested by simu-computers limits the number of elements, which limitslating the velocity field of the Wu River at Chongqing.the accuracy.Parallel computing provides a feasible and efficient1 Governing Equations andDiscretizationsolution for very large-scale prediction. By breakingdown a large task into many small tasks and lettingThe river flow can be modeled using the shallow watermany processes simultaneously complete the tasks,equations which are obtained by integrating theconservation of mass and momentum equations withReceived: 2004-01-12; revised: 2004-07-16* Supported by the National Natural Science Foundation of Chinaassur中国煤化工ditribution in the(Nos. 50379022 and 59979013)verticTYHCNMH G* * To whom correspondence should be adressed.E-mail: jcb@mail.tsinghua. edu.cn; Tel: 86-10-62783704"+卫=0(1)Jt dx;JIANG Chunbo (江春波) et al: Implicit Parallel FEM Analysis of Shallow Water Equations365水+9一L)=-gH+=(r;-r")+where φ stands for q; (i=1,2) or H. The explicitx,日xpschemes in time are then(q + dq,η|, dq,".=0(9)ax, [i,j=1,2(2)At。dx,where η is the water elevation, H is the water depth,+q,")(q,dη°=-gH" +S(q",H")(10)q;=Hu，u is the mean horizontal velocity,g is the)t| 。dx,( Hxgravitational acceleration, v is the eddy viscosity, ρ2) Corretion stepis the water density, ti is the surface shear stress, andIn the correction step, the time differential term isdiscretized asτ is the bottom shear stress.2q| 1"-18”* +9-20~ +0(r)(1)2 Explicit and Implicit Interleaving+16AtSchemeThe implicit schemes in time are thenThe analysis is based on the finite element method us-dn |dq,"+l_≈0(12)ing triangular elements.dtln+l dx,A Taylor series expansion in time gives:dq, |n7"+l+q("/AH)=-gin - +(.,",H")q°t=∞° +0ig)q|, 22φdt l+1' dx,dx+2子(13)△r 8φ|+ 0(Qr)(3)where the source term S(q,H) is defined as6’。φ|+022中S(q,H)=-(t一们)+;ddq,。0q,(14)φ°l=φ"-OtF|-ax,At。22t2。or' 8φ+ O(Ot* )(4)The finite element method for triangular elements is .6 2t.then used for the spatial discretization. For the predic-tion step, the discretized formulations for Eqs. (9) andφ°2=φ° - 2Ot, 0φ|-(10) are explicit both in time and space. However, for)t，2 di,the corection stage, the discretized formulations for△p aφ+0(0r)(5)Eqs. (13) and (14) are6 d|AU"tl=F .(15)Combining the three equations to eliminate the termswhere U**=[qn*+ ,92”,n]".with Ot2 and△t gives:3 Parallel Implementationdq|.2q"*+L +3q" -69q"-'+q"-2+O(O*) (6)dt,6Ot3.1 Parallel environmentdq|_ 1q*+-18q" +9q"-2q"-2- 0(Or) (7)The parallel environment consists of three parts: theThe time discretization then includes a predictionhardware, the operating system, and the software.A cluster of workstations was used because of itsstep and a correction step.ecellent scalahilitv. high nerformance, and low price.1) Prediction stepIn the prediction step, the time differential term isThe w中国煤化工:hitectures with 550-MHzMHCNMHG128MBSDRAM,100-MB network cards, and a switcher formed thedφ|2φ + 3φ" -69°~'+φ~-2 _-+O(0) (8)local fast Ethermet.2t 1,6△t366Tsinghua Science and Technology, June 2005, 10(3): 364 - 371Redhat Linux 7.0 was selected as the operating7) Remove the bounding box.system because of its low memory requirements,After generation, the internal points of the triangularsuperior network performance, and most important, farmesh are slightly modified so that the triangles tendbetter support for a parallel programming environmenttowards equilateral triangles, which are best for the(such as PVM, MPI) than Windows operating systems.finite element computations.The software environments also include a communi-The partitioning of the unstructured grids among thecation library and a parallel programming library.processors should seek to minimize the execution time.Message passing interface (MPI) was used for theThe execution or wall clock time is the maximum timecommunication library because MPI has been an inter-required for the computation among all the processesnational standard for parallel computations and thereplus the communication time between processors. Inare already many libraries that support MPI. The paral-the computation, the workloads assigned to eachlel library used the famous BLAS and LAPACK librar-processor are roughly equal so as to minimize theies which eliminated the need for many basic routines.waiting time. The communication time should beFigure l shows the whole parallel environment.reduced as much as possible, which is to minimize thenumber of ghost points (extra points used to link the个different partitioned subdomains).Application codesIn practice, the generated mesh is partitioned according to the number of processors using a geometricBLASMPI1.APACKmesh partition method". This method uses the graphtheory to devise and validate the partitioning methodRedhat Linuxwhich, unlike the method given by Farhat', has beenproved mathematically to minimize the number ofghost points and to roughly equalize the number ofSwitcpoints in each subdomain.The domain decomposition is based on graph parti-Slave Slave SlaveMastertioning as shown in Fig. 2. Figure 2a shows the finiteFig. 1 Parallel environmentelement mesh. Figure 2b shows the abstracted nodepoints of the triangular elements. The domain decom.3.2 Auto mesh generation and domainposition algorithm is given by:decompositionAlgorithm 2 Domain decompositionUnstructured triangular elements are used for the finite1) Project up. Project the input points from the plane to theelement discretization. A Delaunay triangulation algo-init sphere centered at the origin in the space along the linerithm!2] is used to automatically generate the mesh. Inthroughp and the "north pole" (0, 0, 1) as shown in Fig 2c.this method, the computing domain is defined by many2) Find the center point. Compute a center point of the pro-boundary points which form a polygon. Eachjected points in the space.neighboring two points form a boundary edge. The al-3) Rotate and dilate. Rotate the projected points around theorigin in the space so that the center point becomes a point (0, 0, .gorithm then proceeds as follows.r) on the Z-axis; then dilate the surface of the sphere so that theAlgorithm 1 Delaunay triangulationcenter point becomes the origin as shown in Fig. 2d.4) Find the great circle. Choose a random great circle on the|) Place node points on all edges;unit sphere.2) Enclose the geometry in a bounding box;5) Unmap and project down. Transform the great circle into a3) Triangulatc the edges;4) Check that the triangulation respects the boundaries;circl中国煤化工dilation, rolation, andsterecg 2e.5) Insert node points into the centers of the circumscribed6)|YHCN M H Gano. Figure 2r shows ancircles of the large triangles;edge separator.6) Repeat from step4) if Hmax (maximal mesh dimensionallowed) is not yet achieved;JIANG Chunbo (江舂波) et al: Implicit Parallel FEM Analysis of Shallow Water Equations3670.8-0.4-~0-0.4X-0.8-(a) Example mesh-1.0-0.5 00.51.0I40F(a) Finite element mesh(b) Mesh points in the plane120-400!「3000卜x20406080100120140(c) Projected mesh points(d) Conformal mapped points(b) Original mesh structure (nz =906)140F12110080F-0.50 0.(e) Reverse mesh points(f) Partition of the original0-mesh (42 cut edges)20Fig. 2 Domain decomposition algorithm0 204060801001201403.3 Renumbering(C) Renumbered mesh structure (n-=906)The linear systems are usually preconditioned to reduceFig. 3 Renumbering algorithmthe matrix bandwidth, either by the left-preconditioningmethod or the right-preconditioning method. A large3.4 Parallel GMRESbandwidth will greatly slow the convergence speed andincrease the communication overhead which reducesConsider the linear system AU=F with U,F∈R"advantage of the parallel efficiency.and a nonsingular nonsymmetrical matrix A∈Rxn.However, all the preconditioning methods requireThe Krylov subspace K(m;A;r) is defined bymany floating point operations, so renumberingK(m;A;r)= san(r,..... r) . The GMRES'techniques have been developed to reduce the matrixuses the Amoldi process to construct an orthonormalbandwidth instead of preconditioning. These methodsbasis Vm 2=.2.... for K(m;A;n). The fullare quite economical because the time required forGMRES allows the Krylov subspace dimension torenumbering is much less than that for preconditioning.increase upto n and always terminates in at most nFigure 3 shows an example mesh, the corTespondingiterations. The restarted GMRES restricts the Krylovoriginal matrix structure, and the renumbered matrix中国煤化工lue m and restarts ;structure. The result is quite good with a much smallerthe ACNMHG iteraled xm as anbandwidth which significantly reduces the communica-:MYHtions and the calculations.initial guess 6ut it may stall. The algorithm for therestarted GMRES is as follows.368Tsinghua Science and Technology, June 2005, 10(3);: 364 - 371 .Algorithm 3 Sequential GMRESalgorithm into basic operations which are then done inε is the tolerance for the residual norm;parallel, In the GMRES algorithm, the basic operationconvergence = false;most frequently used is the matrix-vector multi-choose U。plication which should therefore be optimized (reduceUNTIL convergence DOthe communication time as much as possible) so thatr=F-AU;the overall parallel performance is drastically enhanced.β=|rl;For large sparse matrices distributed to processors* Amoldi process: Construct a basis Vm of the Krylovby rows, only a few components in row i of matrixAy are required to cormpute the correspondingsubspace Kw=r/β;component (Ax).FORj=l, m DOThe matrix block with local row and column indicesp=Aw;s denoted by Ane and the matrix block with localFOR i=1,j DOrows but extermal columns is denoted by Ae . Theh=w[p;matrix-vector product is rewritten asp=p-hw;Jo =AooXxro +A. XxoxENDFOR;Each processor keeps a list, send(p), of compo-hytu, =|pl;nents which will be sent to processor p and a list,W)+t = p/h+u,;recv(p)，of components which will be received fromprocessor q. The matrix-vector multiplication algo-* Minimize |F-AU| for U∈U。+Krithm!"" is then as follows:* Solve a least- square problem of size mCompute y。solution of min , n.|se - Hoy|;Algorithm 4 Distributed sparse matrix. vectorproductUm =U+Vmym;1) Send xc[send(p0] to processor pIF |F-AU|<ε , convergence = true;2)Compute JYoe =xXxoc;U,=U_;3) Receive xn[recv(p)] from processors q and buildENDDO4)Compute Hhe = JY0 +.Ax:The matrix Hm is an upper Hessenberg matrix oforder (m + l)xm with the fundamental relation can beBecause this algorithm overlaps communicationsobtained:with computations and is scalable, it is an excellentparallel algorithm.AV.=VH..The GMRES algorithm computes U=U。 +Vmym,Iwhere y,solves the least squares problemmin p- βe, - Hmy|. Usually, a QR factorization ofHm using Givens rotations is used to solve this leastsquares problem.AloelexdAgSeveral authors have studied the parallelization of↑the GMRES algorithm. Some alter the sequential Ar-noldi process to achieve parlleliztion's, some de-fine a new basis of the Krylov subspacetis, while oth-ers give variants such as a-GMRES, and GMRES中国煤化工with a Chebychev basisMHC N M H Gr product algorithmThe present parallel implementation does not changethe sequential GMRES algorithm, but breaks theJIANG Chunbo (江春波) et al: Implicit Parallel FEM Analysis ofShallow Water Equations3694 Numerical Experiments3.0_Zdravkovich"21。Present work.0|1.1 Flow around a circular cylinderi 1.5The flow around a circular cylinder has no fixed1.0-separation point; therefore, the flow pattern is related0.5to the Reynolds number. The parallel implementation10 2050 40 Sowas used to calculate the velocity field at differentReReynolds numbers below 300.Fig. 6 Length of wake region vs. Reynolds number4.1.1 Computational domain and boundaryWith increasing Reynolds numbers, the Karmen vor-conditionstex street behind the cylinder becomes asymmetric. The :As shown in Fig. 5a，D represents the diameter of thevortex shedding period can be described by the Strouhalcircular cylinder. The computational domain is aboutnumber (Sr). The results in Fig. 7 show that one period25D long in the flow direction, 10D width in the direc-is from 29.8 s to 38.8s.tion perpendicular to the main flow direction, with anentrance length to the center of the circular cylinder of5D, which insures that the flow properties in the wake6.5[region are not affected by the outside boundary. Theboundary conditions are a parabolic velocity distribu-tion at the inlet (left), constant water depth at the outlet(。。(right), shear stress equal to zero on the upper andlower surfaces, and the non-slip velocity boundary3.54condition on the cylinder face. A very fine grid was(a)1-29.8sused near the cylinder face to accurately locate theseparation point. The computational domain was then5.5partitioned using the domain decomposition algorithm.昌Figure 5b shows the 8-way partition.4.5(b)1 =32.8sL(a) Computational domain3.5(2)1=35.8s6.5厂(b) 8-way partition.5-Fig. 5 Flow around a circular eylinder4.1.2 ResultsAt low Reynolds numbers, the flow is laminar. The .中国煤化工厂8岁wake region length at different Reynolds numbers areHCNMHGcompared with the empirical formula L, ID = 0.05ReFig. 7 Streamline for flow around a circular cylinder(5< Re <40) given by Zdravkovich!21 in Fig. 6.(Re=200)370Tsinghua Science and Technology, June 2005, 10(3): 364 - 371The Strouhal number then is .km long with a water level of 175 m after the Three-sr=Lf.U=0.221 (16)Gorges Reservoir is built. The flux on the inlet side is(38.8- 29.8)x13460 m'/s and the flux at the confluence is 368 m'/s.The Strouhal numbers at different Reynolds num-The velocity distribution at these two sections isbers are compared with that given by Williamson!13]assumed to be uniform. Non-slip velocity boundary(Fig. 8). The error is relatively large at Reynolds numbersconditions are assumed along the bottom.greater than 200. In 1996, Williamson pointed out that, for4.2.2 Resultssuch conditions, the flow will show some threedimensionalThe simulated velocity field is shown in Fig. 9.characteristics so the error of a two-dimensional simulationwill increase with Reynolds number.4.3 Evaluation of parallel performance0.26--williamson!31Parallel performance is often evaluated by two factors:。Present work_T0.22◎口Cand E=0.18in which N is the number of processors, T is the/,。computation time for one processor, Tv is the total00 150200250 300 350computation time for the N processors, S representsRethe parallel speedup, and E represents the efficiency.Fig. 8 Strouhal number vs. Reynolds numberThree kinds of finite element meshes were used to .simulate the velocity field in the two numerical exam-4.2 Flow in the Yangtze River at Chongqingples above, respectively. Tables 1 and 2 show theircorresponding parallel performance parameters.4.2.1Computational domain and boundaryconditionThe computational domain includes the confluence ofthe Jialing River at Chongqing. The region is about 2916 r2f1.0一一Im/s昏8昌ot-1.0|0.24.6810-2.0.04.05.0X/kma)(b)Fig. 9 Simulated velocity feld in Jialing River at ChongqingChongqing. The large linear system was solved using a5 Conclusionsdi中国煤化工:thod, which has anIn this paper, shallow water equations were solved inlYHC N M H GMRES methods, becauseit does not alter the sequential GMRES algorithm so itparallel to simulate the velocity of the Wu River atis casily programmed and used. The computed resultsJIANG Chunbo (江春波) et al: Implicit Parallel FEM Analysis of Shallow Water Equations371Table 1 Parallel fficiencies for the flow around a circular cylinderNode numberT/sNTyIsSE227131.6580.82929184498415382.9240.73189744.6160.57710 1201.7900.89511 51618 11554043.3520.83828066.4560.80751 2371.8220.91145 75293 35426 5813.5120.87813 9426.6960.837Table 2 Parallel eficiencies for the flow along the Yangtze River at ChongqingTIsTy1s66311.7460.873378411 57836553.1680.7922047_5.6560.7072253I1.8260.91313 69641 14211 9323.4480.862 .61376.704111 4290.95366 738212 38457 6503.6840.9218_29 5637.184_0.898agree well with the observed results. The good speedup[7] Kim S K , Chronopoulos A. An eficient Amoldi methodand efficiency for the parallel computation show thatimplemented on parallel computers. In: lntermationalthe parallel computing technique is a good method toConference on Parallel Processing.Pennsylvaniasolve large-scale problems.University, 1991, 3: 167-196.[8] De Sturler E. A parallel variant of GMRES(m). In: Miller JReferencesJ H, Vichnevetsky R, eds. Proc. of the 13th IMACS WorldConference on Computation and Applied Mathematics.{1] Miller G L, Teng S H. Automatic mesh paritioning. In:IMACS Dublin: Criterion Press, 1991: 682-683.George A， Gilber J， Liu J， eds. Sparse MatrixComputations: Graph Theory Issues and Algorithms, IMA[9] Xu X, Richards B. Alpha-GMRES: A new parallelizableiterative solver for large sparse nonsymmetrical linearVolumes in Mathematics and Its Applications. New York:systems arising from CFD. Int. J. Numer. Meth. Fluids,Springer-Verlag, 1993.1992, 15(5) 613-623.[2] George P L. Automatic Mesh Generation-Application toFinite Element Methods. Washington, USA: WILEY, 1991.[l0] Joubert W D, Carey G F. Prallizable restarted iterativemethods for nonsymmetric linear systerms l-theory, II-[3] Farhat C. A simple and efficient automatic FEM domainparallel implementation. Int. J Comput. Math, 1992, 11:decomposer. Comp. and Structure, 1988, 28(5): 579-602.243-290.4] Saad Y, Schultz M H. GMRES: A generalized minimalresidual algorithm for solving nonsymmetric linear. [1] Jocelyne Erhel. A parallel GMRES version for generalsystems. SLAM, J. Sci Statist. Comput, 1986, (7): 856-869.sparse matrices. Electronic Transactions on NumericalAnalysis, 1995, (3): 160-176.[5] Di Brozolo G R , Robert Y. Parallel conjugate gradient-likealgorithms for solving sparse nonsymmetric linear systems[12] Zdravkovich M M. Flow Around Circular Cylinders:on a vector multiprocessor. Parallel Compul, 1989, (1I):Fundamentals. New York: Oxford University Press, 1997.223-239.[13] Gropp W. Lusk E. User's guide for MPICH, a portable[6] Da Cunha R D , Hopkins T. A parallel implementation ofm!中国煤化工Report ANL-96/6.the restarted GMRES iterative algorithm for nonsymmetric!YHCNMHGsystems of linear equations. Adv. Comp. Math, 1994, (2):261-277.

论文截图