An Empirical Study on the Impact of Automation on the Requirements Analysis Process An Empirical Study on the Impact of Automation on the Requirements Analysis Process

An Empirical Study on the Impact of Automation on the Requirements Analysis Process

  • 期刊名字:计算机科学技术学报
  • 文件大小:897kb
  • 论文作者:Giuseppe Lami,Robert W. Fergus
  • 作者单位:Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo",Software Engineering Institute
  • 更新时间:2020-11-11
  • 下载次数:
论文简介

Lami G, Ferguson RW. An empirical study on the impact of automation on the requirements analysis process. JOURNAL OFCOMPUTER SCIENCE AND TECHNOLOGY 22(3): 338~347 May 2007An Empirical Study on the Impact of Automation on the RequirementsAnalysis ProcessGiuseppe Lamil and Robert W. Ferguson21 Istituto di Scienza e Tecnologie dllInformazione “Alessandro Faedo", 1-56124 Pisa, Italy2 Software Engineering Institute, Carnegie Mellon University, 15213 Pittsburgh, PA, U.S.A.E-mail: giuseppe.lami@isti.cnr.it; rwf@sei.cmu.eduReceived March 15, 2006; revised December 5, 2006.Abstract Requirements analysis is an important phase in a software project. The analysis is often performed in aninformal way by specialists who review documents looking for ambiguities, technical inconsistencies and incomplete parts.Automation is still far from being applied in requirements analyses, above all since natural languages are informal andthus dificult to treat automatically. There are only a few tools that can analyse texts. One of them, called QuARS, wasdeveloped by the Istituto di Scienza e Tecnologie dell'Informazione and can analyse texts in terms of ambiguity. This paperdescribes how QuARS was used in a formal empirical experiment to assess the impact in terms of effectiveness and eficacyof the automation in the requirements review process of a software company.Keywords requirements/specifcations analysis, requirements/specifcations tools, process metrics1 Introductionbased techniques can support the review process[4), themost common being checklists. A checklist can be usedNatural language (NL) is the most commonly used efectiveley for checking individual requirements, but ismeans of representation for expressing requirements for less useful for finding discourse ambiguities, if these 0C-computer-based systems in industry. Although natu- cur among requirements that are scattered over severalral language has the advantage of being universal andpages. Scenario-based review techniques5l are used inflexible, it is inherently ambiguous. Words can have dif- this second type of ambiguity. The overall idea is to pro-ferent meanings. The Oxford English Dictionary states vide an inspector with an operational scenario. To dothat the 500 most used words in English have on aver-that, the first step is to create an abstraction of a prod-age 23 meaningsl4l, and the same is probably true for uct, and then to answer questions based on analyzingother languages. Phrases and sentences may also have the abstraction with a particular emphasis or role fordifferent interpretations because different sequences ofthe inspector. For example, the inspectoir might createsyntax rules of the grammar may be used to generatetest cases for a requirements document and then answerthe same sentence. In some contexts ambiguity in NL the question,“Do you have all the infornation needed .is not necessarily a weakness but in the requirementsto develop a test case"? If the answer is“no”, then aspecification it definitely is.defect may have been detected|2].In [2] two types of ambiguity are identified: linguisticThere are also some tool-based techniques for per-and RE-specific ambiguity. Linguistic ambiguity entails forming requirements analysis. In [6] there is an exten-multiple interpretation of words and phrases and is con- sive survey of techniques and tools for textual require-text independent, whereas RE-specific is context depen- ments analysis. The techniques are classified into threedent and can only be observed by taking into account categories: restrictive (based on rules and constraintsthe context of particular requirements or the whole set that limit the level of freedom in writing requirementsof requirements in the specification.in NL); inductive (based on the indication of safe writ-There are several ways to reduce the negative impact ing styles for requirements without providing the meansof ambiguity3). In an NL requirement specification doc-to apply them); and analytic (based on linguistic tech-ument, one way is to perform joint or external reviewniques to identify and remove defects). Although thereor inspections因. Some organisations use professional are several techniques and tools for analysing naturalrequirements analysts to identify ambiguities (both lin- language requirements, there is ltte evidence in the lit-guistic and RE-specific). The costs for reviews are high erature that they have been empircally validated or as-both in terms of money and time. Moreover, consider- sessed for efectiveness.ing the volatility of requirements during a project life,Empirical studies are important to establish the rel-reviews may hold up the entire project. Knowledge of ative中国煤化工d of study that con-the specific application domain by the requirements an- tributOne of the rea-alysts is a key factor in the success of such a review.sonsYHC N M H Gessmnt in requireThere are several techniques for addressing the am- ments analysis is the dificulty in aggregating empri-biguity of NL in requirements specification. Inspection- cal resultsl7. In fact, these tools/ techniques are usu-Regular PaperGiuseppe Lami et al: An Empirical Study on the Impact of Automation339ally developed in a particular environment and the local●Quality indicators are specific descriptions of some-assumptions may preclude sound assessment or bench- thing that provide evidence either for or against themarking. Moreover, case studies with industrial part- existence of a specific quality characteristic or subners are important for showing that they are feasible, characteristic.as well as for confirming the value of case studies them-●Quality metrics provide numerical values by esti-selves.mating the quality of a work product or process by mea-In this paper we describe how an automatic linguisticsuring the degree to which it possesses a specific qualityanalysis tool, QuARS, has been used in a controlled and characteristic.rigorous empirical study. QuARS lexically and syntacti-The Quality Model defined in [9] for the Expressive-cally parses the textual requirements in order to identify ness property of NL sofitware requrements provides aany potential ambiguity. We highlight the advantages way to perform three types of evaluation: firstly, quan-of automatic analysis of NL requirements with respect titative which allows the collection of metrics; secondlyto a manual review process by human experts.corrective which could be helpful in the detection andThis paper is structured as fllows. Section 2 gives correction of the defects; and thirdly repeatable whicha general overview of QuARS. Section 3 outlines the provides the sarne output against the same input in ev-criteria for setting up a valid empirical study that canery domain.provide general conclusions, and in Section 4 we describeOur definition of the Quality Model was driven bythe design of the empirical study. Section 5 contains thesome results in NL understanding, by experience in for-data gathered in the study, which are then discussed andmalising software requirements and also by an in-depthinterpreted in Section 6. Finally, some conclusions areanalysis of real requirements documents provided by in-drawn in Section 7.dustrial partners. Moreover, we took the advantage ofexperience in the field of requirements engineering and2 Requirements Analysis Methodology andsoftware process assessment using the SPICE (ISO/IECTool - - Background15504) model1o]. Although not exhaustive, the model issuficiently specific to include a significant part of lexicalThis section outlines the methodology and the related and syntax-related issues of requirements docurments.automatic tool to deal with linguistic ambiguity in NLThe quality model for expressiveness consists ofrequirements used in the empirical study presented in three qualities which are evaluated by means of indi-this paper.cators. Indicators are linguistic components of the re-quirerments directly detectable and measurable on the2.1 Quality Model for Requirements Analysisrequirements document.The expressiveness characteristics are:The main quality properties that can be addressed●unambiguity: each requirement has a unique inter-and evaluated by means of NL-understanding techniquespretation.can be grouped into three categories:completion: each requirement can●Expressiveness, i.e, the incorrect understanding ofuniquely identify its object or subject.the meaning of the requirements, specifically ambigui-ties and poor readability.●understandability: each requirement can be fully) Consistency, i.e., the presence of semantic contra-understood when used for developing software and thedictions in the NL requirements documents.requirement specification document can be fully under-●Completeness, i.e, the lack of necessary informa- stood when read by the user.tion.Indicators, in this case, are linguistic or structural as-Nevertheless, NL-understanding techniques are ef- pects of the requirements specification documents thatfective above all in addressing issues related to Expres-provide information on the defects related to a particu-siveness, because the lexical and syntactical levels pro- lar property of the requirements themselves.vide enough means to detect a large number of defects.The sentences recognised as defective according toA Quality Model is the formalisation of the defini- the quality model are not defective sentences accordingtion of the term“quality" to associate with a type of to the rules of general English grammar, but are incor-work product. The typical objectives of a quality model rect in terms of the expressiveness defined above.are to define, analyse; and document a product:What characterises our approach to the analysis of●Quality characteristics define and document the NL requirements is the application of NL-understandingrelevant quality factors (also known as quality attributes techniaues laddressing hnth lexical and syntactical as-or“ilities") which are important attributes of work pects中国煤化工:ming a quantitativeproducts (e.g., applications, components or documents) andHCNMHGments. In fact, ouror processes that characterise part of their overall qual- tool h!currences of the quaity (e.g., extensibility, operational availability, perfor- lity model indicators throughout the requirements doc-mance, re-usability). Quality sub-characteristics are im- ument, thus enabling the document to be subsequentlyportant. components of quality characteristics.corrected manually.340J. Comput. Sci. & Technol, May 2007, Vl.22, No.32.2 QuARS Toolquality model entails applying different linguistic tech-QuARS was developed at the Istituto di Scienza eniques. In particular, for certain kinds of indicators onlyTecnologie dell'Informazione of the Consiglio Nazionalea lexical analysis is needed to search for the terms, con-delle Ricerche (ISTI-CNR) located in Pisa, Italy to maketained in the related dictionary, in the text of the re-automatic NL requirements analysis systematically, onquirements document. For other indicators the syntac-the, basis of the quality model described in Subsectiontical structure of the sentences in the text needs to be2.111. The aim was to develop a modular, extensiblederived before a defect can be pointed out. Table 1shows the kind of analysis necessary for detecting eachtool with a user-friendly graphical interface.of the indicators of the quality model.QuARS performs the Expressiveness analysis bymeans of a lexical and syntactic analysis of the inputfile in order to identify those sentences containing de-Table 1. Type of Analysis for DetectingSub-Characteristics Related Defectsfects according to the quality model. When the Expres-- Indicator -Lexical Analysis Syntactical Analysissiveness analysis is performed, the list of defective sen-- Vaguenesstences is displayed by QuARS and a log file is created.SubjectivityThe defective sentences can be tracked in the input re-Optionalityquirements document and analysed, and, if they do notImplicityWeaknesscontain“false positives" (ie, defects pointed out by theUnder -Specifcationtool but not considered real defects by the analyst) theyMultiplicitycan be corrected by editing the text directly. MetricsReadabilitymeasuring the defect rate and the readability of the re-A complete definition of the QuARS indicators isquirements document being analysed are calculated andprovided in [5]. The lexical analysis of the requirementsstored. The available metrics are:●the Coleman-Liau Formula readability metrics([12;document is sufficient to detect Vagueness, Optionalityand Subjectivity defects. In fact, such defects consist. the defet rate (i.e, the number of defective sen- of the ocure of speial dfctrevetaling terns andtences/ the total number of sentences). .phrases in the requirements. To highlight the other in-QuARS also includes requirements clustering or dicators, knowledge of the syntactical structure of theView Derivation. This means that cltions of require sentences is required. In fact, the implicity indicatorsments can be handled in order to highlight chusters of can be detected if the subjects and the objects of eachthem that deal with specific properties or topics.sentence are known, the weakness indicator needs theThese clusters are called views. The derivation of a identifcation of the verbs, the under-specification indi-View from a document uses special sets of terms contain- cator needs to know the relationship between nouns anding an appropriate corpus that can be put in relation tomodifers, and the multiplicity indicator needs to knowspecic properties. These sets of terms are grouped into which elements in a sentence are the subjects and whichV-dictionaries. The requirements clustering can provide are the verbs. Fig.1 shows the QuARS GUI for the Ex-support for consistency analysis, traceability definition pressiveness analysis of requirements.and verification of the correct organisation of the re-quirements document. However, the requirements clus-tering feature of QuARS is not used in the empiricalstudy described in this paper - - for additional informa-tion see [13].The input file containing the requirements documentto be analysed is given to the syntax parser, which pro-duces a new fle containing the parsed version of thetext. The tool uses a set of Indicator-related dictionar-ies that contain terms indicating a defect type accordingto the quality model. The dictionaries are in simple textformat. Once the user has selected the type of analysis,the corresponding dictionary is made available and, ifnecessary, can be tailored. Not all the aspects relatedto the automatic support of the analysis/evaluation ofrequirements quality can be addressed in the same wayFir.1. Expressiveness analvsis: the QuARS GUI.and with the same depth and ease. The indicators are中国煤化工ctorsthe basic elements to identify defects and collect met-CNM H G_rics. Since the indicators are linguistic components, itMHpol HIIL山ouLtware engineering canis necessary to define precise sets of terms and linguisticprovide useful indications for practitioners and re-constructions to be used for indicator detection.searchers because they can contribute to the characteri-Detecting the various indicators contained in thesation of development technologies in terms of their abil-Giuseppe Lami et al: An Empirical Study on the Impact of A utomation341ity to improve software engineering and in assisting thedescribed and sorted according to the maturity of theevolution of technologies to support technology trans- technology under study.fer. The validity of the results of these studies stronglydepends on the criteria used to set up the experimental3.2.1 Observing Phenomenaenvironment, the way the premises of the experiment aredefined, how the experiment is conducted, and the con-The goal is to characterise the problem being ad-clusions drawn. The problem of giving well-understooddressed or the context in which the technology wouldand sound theoretical basis to empirical studies has beenbe used to make explicit the rationale for a technologyaddressed in the literature [14~16].or develop its informal specifications. This kind of goalIn this section we identify and discuss the principalis to achieve an empirically-based understanding of theey factors that should be focused on when a study isreal world.set up. Understanding these factors may help to im-prove the rigor of the experiment, the soundness of the 3.2.2 Validating the Ideainformation gathered, and the validity of the results.The key factors are listed and described below. IrAn idea is typically validated when a prototype orSection 6 they are used to discuss the validity of thean early instantiation of the technology is available inorder to verify if the basic concept warrants the nec-results described in this paper.essary additional effort to make the current technologymore mature. The results of the use of the technology3.1 Type of Technologyin its current status are used to determine if it is usableThe aim of an empirical study is to provide insightsand useful in practice. Depending on these results theand answers that can assist the evolution of techno-technology should be re-designed or fully instantiated.logy or find the appropriate environment in which it can3.2.3 Debugging the Technologybe used. The technology involved includes any mecha-nism for improving software development (for example,When all the features of the technology have beena method, a tool, a technique, a model). In this case,implemented, the goal of an empirical study is to verifythe technology under investigation is the QuARS tooland, if necessary, to improve its instantiation. In thisto analyse NL requirements described in Section2. Itcase, the objective may be to understand if the tech-is important to classify the type of technology in annology fits in with typical work practices or matchesempirical study because this determines what needs tothe way people typically think about the problem thebe measured in the study itself. In fact, the object oftechnology is addressing. The measurements and datathe study can be either a constructive or an analyticalinvolved are more qualitative and usage-oriented thantechnology. The first type is made up of technologiesquantitative.that are used as part of new systems, or that produceparts of new systems (e.g, reliable communication pro-3.2.4 Verifying Project Usetocols, secure networks, code generators). The lattertype is made up of technologies able to support devel-The verification of the usability and effectiveness ofopers by providing information and data on the systemthe technology when used in a particular environmentunder developing. The measurements that can be em-. is the aim when the 'technology reaches a mature status.pirically derived in the case of constructive technologiesIn this case, access to an industrial environment andare related to the reliability, safety, security, etc. of thethe availability of historical data is required in order tosystem in isolation or integrated with other components,evaluate the benefits the technology may provide withthe type of costs imposed by the technology, or the kindsrespect to the current practice.of support materials that might reduce this cost. Themeasurements that can be empirically derived in the3.3 Testbedcase of support technologies (analytical, in our case) areto do with the soundness (i.e., what degree of correct-The environment where the experiment is conductedness and relevance of problems can be found) and the is another important factor. The testbed is not onlycompleteness (i.e, what is the rate of existing problemsmade up of technical infrastructures used to run the ex-that can be found) of the technology itself.periment but also composed of the subjects that performthe experiment.ned in a laboratory3.2 Goalswhe:中国煤化工e technology would beDiferent studies that consider the same technologyappliTHCNMHGed is easier to control,may be carried out in very diferent ways depending onbut ion the confidence lev-the aims of the study itself. Generally, aims vary ac-els of the evidence collected due to the effects of thecording to the level of maturity of the technology. Insimulations. Laboratory testbeds are suitable for vali-the following the various aims of an empirical study are dation or debugging.342J. Comput. Sci. & Technol, May 2007, Vol.22, No.3Alternatively, the testbed may be the real context between two techniques used to analyse an item, the(or an "industrial size”system of realistic complexity) characteristics of the items used in the various experi-of use of the technology under observation. The ad-ments should be taken into account, to be sure that thevantages of this type of testbed are the high levels ofdiferent nature of the techniques does not afect and in-confidence of the results. However, it may be hard to fuence the outcormes. Threats to validity must be madeobtain and it can also be expensive both in terms of theexplicit and are listed below.effort required by the researcher to understand the con-●Internal validity: the internal consistency of thetext and in terms of the dificulty of involving industrialstudy that determines whether the conclusions arepersonnel in such a study.valid.In fact, the subjects involved in the performance of- Consider and discuss possible biases introducedthe study have a strong impact on the results. Novices,when participants are motivated to change theirsuch as students, are usually only suitable for valida-behaviour:tion/ debugging technology experiments because of their- motivation bias;poor experience. The involvement of experienced pro-- author bias.fessionals guarantees more reliable data, but can be very●There are several classes of threats to be analysedexpensive.or controlled to avoid rival hypotheses, for exam-ple:3.4 Measurements- history: did events occur between two measure-ments that are extraneous to the study but thatCollecting measurements is the expected output ofmight explain observed differences?any empirical study. It is important to understand whatwe are going to measurel7]. It is necessary to ensure. maturation: were there likely changes to the sub-jects' performance over time that biased some ofthat the performance measured is really a result of us-ing the technology under study. In fact, the validitythe measurements?- testing: could the experience of being measuredof the measurements collected depends on the way the(for data collection) influence the subjects' per-process is conducted, in particular, the process observedformance during a later measurement?needs to be centred on the technology of interest and notinfuenced by how it is usually conducted.. instrumentation: were there changes in the waythe measurements were collected that might haveThe cost of using the technology should always becaused different results?considered in the measurements. The possible benefts- selection: was there any potential bias in the wayof the application of the technology under observationsubjects were assigned to different groups?are dependent on its cost. The costs related to the use- experiment mortality: if subjects are allowed toof a technology have to be identifed (including the hid-drop out of the study, are certain types of sub-den costs, e.g, the costs relating to human motivation,jects more inclined to do so than others?or those related to the lack of integration with existing- interaction effects: any of the above threats mayprocesses).interact in a way to produce an effect for certainThe impact of a technology on the system (i.e, thebenefts) may be measured from two stand points: thesubsets of subjects that may bias the results.researcher's point of view (i.e, the measurements are●Construct validity: the mapping between the spegeared to evaluating the extent to which the technologycific measures collected and the underlying con-performs according to the specifcations and solves thecepts under study. Do the collected data reallyidealised problems) and the project point of view (i.e,measure the factors of importance?the real impact the technology has for the users and the●External validity: the ability of the study's conclu-extent to which it solves the real-world problems). Insions to be generalised to some real-world contextthe first case the“what”to measure should be derivedof interest. Can the results of a study be gener-from the requirements, in the latter the measurementsalised to the population of interest?should be related to one or more“ilities" (e.g., reliabil-ity, usability, availability, maintainability).The three aspects discussed in this section (process4 QuARS Empirical Studyconformance, costs and benefits) should be taken intoThe aim of the study described in this paper is toaccount and cross referenced in order to interpret theevaluate the efects of the use of QuARS for analysingresults.NL 1中国煤化工-rial project. To per-3.5 Threats to Validityformvement of a companyYC N M H Gnal Unit involved inA threat to the validity of a study is any factor (in the study 1s part ot a global company working in theaddition to the ones being studied) that provides alter- telecormmunications field. The company performs thenative and plausible explanations of the observed mea- analysis of NL requirement documents by assigning ex-surements. For example in a comparative experiment ternal consultants to review the documents in order toGiuseppe Lami et al: An Empirical Study on the Impact of Automation343identify and report defects. The company collects per-H.4. - + The efort to identify false positives is higherformance data of the requirement analysis process as the than that to identify actual defects.effort required and the number of defects found. QuARSTo measure the stated hypotheses some metrics havewas then run on the same documents that were reviewed been defined. In Table 2 these metrics are described andby the consultants. The outcomes of the application of associated with the hypotheses.QuARS in terms of defects found and effort requiredThe experiment was conducted in two separatewere compared with the analysis reports produced by phases, the first addressed hypotheses H.1, H.2. andthe consultant in order to evaluate the possible added H.4, the second hypothesis H.3.value of using QuARS.In the first phase QuARS was used at the companyThe data gathered are shown in Section 5, whereas by a senior engineer, who collected data on the perfor-this section provides the methodological details of the mance of the analysis with QuARS. The data were theexperiment.number of defects he found and the number of false pos-Similar to any empirical study the first step is the itives detected by QuARS, the effort needed (includingstatement of a set of hypotheses to be verified by means training and false positives detection efort) in terms ofof the experimenti18). The bypotheses stated are:hours spent. To collect empirical evidence, six differentH.1.- - QuARS is more cost effective than human in- requirements documents were used as shown in Tablespection;3. These documents were taken from real past projectsH.2.- - QuARS enhances the quality of the require- where human inspection was conducted.ments with respect to human inspection;The second phase was conducted by the authors ofH.3.- QuARS can be used to complement the hu- this paper on the basis of a review report produced byman inspection;the human reviewer reporting the defects found duringH.4.- The rate of false positives QuARS produces the review of a requirements document containing 1044affects its cost effectiveness.sentences (hereafter D-7). The report describes all theIn order to make the previous hypotheses verifiable defects found and cassifes how critical they were (i.e,we transformed them into more specific hypotheses that their potential level of danger).can be transformed into simple logical expressions:The authors analysed the defects reported by the hu-H.1. →The average efort per defect found with man inspector to work out whether each of them couldQuARS is less than that with human inspection.or could not be detected by QuARS.H.2.→The number of defects detected with QuARSis higher than that with human inspection.5 EvidenceH.3. The Symmetric Difference of the set of de-fects found by QuARS and the set of those found by theIn this section the results of the experiments arehuman inspector is not null.shown along with the calculated metrics values.Table 2. Metrics Defined for the Empirical StudyRelatedMetricsRationaleHypothesis1.1. M1= (NDFq7 MHCq)/ M1 represents the rate between the effectiveness (in terms of defects found per time unit)(NDFH/MHCH)of the analysis made with QuARS and that made by the human reviewer.M1≈1 indicates that the effectiveness of the two approaches is comparable, M1《1indicates that human review is more effective that QuARS, M1》1 indicates theopposite.H.2.M2 = NDFQ/NDFHM2 indicates the rate between the number of defects found using QuARS and by the humanM2≈1 indicates that the two approaches have a similar impact on the quality of theNL requirements, M2《1 indicates that the human inspection is able to enhance thequality more than QuARS, M2》1 indicates the opposite.H3.M3= NDFQ and HTM3 indicates the rate of defects found both by the human reviewer and QuARS withNDFQorHrespect to the total amount of defects found by the two approaches together.M3≈1 indicates that QuARS can replace a human reviewer, M3《1 indicates that theuse of QuARS is complementary to reviewing by humans.H4.M4= (NFP/MHCpp)/M4 indicates the rate of the mean effort to detect a false positive and the mean effort to(NDFQ/ MHCq)detect an actual defect.4≈1 indicates that the cost of detecting a false positive is comparable with that of1 indicatesthate COSdetecting an actual defect. M4《1 indicates that the cost。detecting a false: positiveis greater than that to detect an actual defect, M4》1 indicates the opposite.Notes: NDFQ:NDFH: number of defects found by the Human Reviewer;NDFq and H: number of defects found both by the Human Revi中国煤化工NDFQ or H: size of the total set of defects found by the human innd by QuARS (falsepositives excluded);:YHCNMHGMHCq: man hours consumed by running QuARS;MHCFP: man hours consumed looking for false positives;NFPq: number of false positives found by QuARS.344J. Comput. Sci. & Technol, May 2007, Vol.22, No.3Table 3. Characteristics of the NL Requirements Documentstool. The number of the actual defects found by QuARSNL Requirements Source DocumentNumber ofDocument IDFormatSentencesis reported in the“Real Defects" column, while that ofD-1.doc249false positives is reported in the "False Positive" column.D-2.pdf314Table 6 reports the data collected in the secondD-31642D-4.html177phase of the study. The defects found by the humanD-536review are classified into three classes:D-6html19292●Class Q: those defects that can be found by QuARSTable 4 shows the measurements related to the efortas well. These kinds of defects are expressiveness relatedrequired to perform the analysis of the six documentsdefects, according to the terminology used in Section 2;indicated as D-1, .... D-6 by human inspection and by●Class NS: those defects that are out of the scope .QuARS. The efort related to the analysis performedof QuARS because they deal with semantic issues;using QuARS is broken down into the effort to run theClass NE: those defects dealing with editorial is-tool for analysis, including the training effort (QuARSsues (e.g, missing or wrong formatting of the text) thatcolumn) and that detecting false positives (FP column).cannot be found by QuARS.The defects reported by the human reviewer duringTable 4. Efort Required Using QuARS vs. Human Inspetion the analysis of the document D-7 have been clasifedNL Requirements No. Typeof_ Efort (Man Hours)and the size of each class is reported in Table 6. TableDocument ID_ Sentences Analysis QuARS FP Total6 also contains the number of defects belonging to eachHuman1.5criticality level (H, M and L), according to the ratingQuARS 0.5 1made by the human reviewer.QuARS_0.The data reported in this section have been used to4. QuARS0.25 1.1.75calculate the metrics described in Table 2. In Table 74.the values in columns M1~M4 refer to the value of the0.25 0.25 0.5corresponding metrics.136QuARS 0.25 0.25 0.6 Interpretation of the ResultsTotal271062The following conclusions can be drawn from the data25 6.25reported in Section 5 and the calculation of the metrics:Table 5 shows the results in terms of the number1) M1 indicates that QuARS is much more effectiveof defects found by the human reviewer and QuARS. than the human review. The average value shows thatThe data in the“Number of Defects Found" column in- the number of defects detectable with QuARS per hourcludes the number of false positive pointed out by the is 32.11 times higher than that by human review.Table 5. Number of Defects Found by QuARS vs. Human InspectionNL Requirements Number of Type ofRealFalseStatements Analysis Defects FoundDefects Positives64208107234263221041"92_22484117173739560179Table 6. Classifcation of the Defects Found by the Human Inspector on D-7Doc.Number of Total NumberNumber of Defectslotal Number otIDStatementsDefects Foundof Class Qof Class NSin Class NE100%37%中国煤化工-8%D-71.394YHCNMHGM62% 28% 10% 54% 24% 22%61% 35% 4%100% 0% 0%Giuseppe Lami et al: An Empirical Study on the Impact of Automation345Table 7. Metrics Values from the Empirical Studythe experience in using QuARS may reduce the impactRequirementsM1M2M4M3of removing false positives on the effectiveness of theDocument IDD-129.33 7.33 0.24QuARS approach.D-21.090.82 0.195) Table 6 also shows that the level of criticality ofD-309.560.05the defect detected by QuARS is comparable with thatD-417.101.900.63D-5.13.04 1.86 0.17of the defects found by the human reviewer. In fact,56.36 5.640.13.QuARS found 46% of the defects evaluated as highlyTotal32.113.240.15or medium critical, while this percentage is 39% for the(D-1~D-6)human reviewer. This means that the“severity” of theD-70.37defects found by QuARS and these found by the humanHowever, the corresponding values for the individualreviewer is substantially the same.documents vary from 1.09 (D-2) to 209.56 (D-3). Theeffectiveness of using QuARS would thus seem to de- 7 Validity of the Resultspend on the nature and contents of the documents aswell as their size.In this section the list of key criteria for the perfor-2) M2 indicates that QuARS is able to find more mance of an empirical study discussed in Section 3 aredefects than the human reviewer irrespectively of theconsidered again and compared against the specific na-time required. The average value shows that the num- ture of the empirical study. The aim is to validate theber of defects detectable with QuARS is about three empirical study from a methodological point of view.times higher than that by humans. The values for the If the methodology adopted is sound, the conclusionsindividual documents vary from 0.82 (D-2) to 8.94 (D- derived can be considered signifcant.3), and in this case too the nature of document may beType of Technology. QuARS was conceived as a nat-the cause of the signifcant variations in the value of thisural language requirements checker, it thus belongs tometric.the category of analysis technology. The measures de-3) M3 indicates that QuARS complements the hu-rived will be considered, according to Subsection 3.1,man review process.only as addressing the soundness and the completenessIn fact, about two out of three (63%) defects detectedproperties of NL requirements. The results of the studyby the human reviewer cannot be detected by QuARS,are used to estimate the effectiveness of the applicationbecause the class of defect is not in the scope of theof QuARS in the requirements analysis process and ontool. This indicates that the human review cannot be the results produced by QuARS. The conclusions re-completely replaced by a tool like QuARS.garding QuARS only deal with the extent to which theMoreover, if the data in Table 6 are cross -checkeddefects pointed out by QuARS are real and relevant (ie,with those in Table 5, it can be observed that the abil- the soundness) and which quality characteristics of NLity of QuARS capability to detect class-Q defects is tenrequirements the tool is able to address (i.e., complete-times higher than that of the human reviewer. In fact,ness)since about one third of the human- reviewer -detectedGoals. The purpose of this empirical study is to ver-defects belong to class-Q (Table 6), and the total num-ify the use of QuARS.ber of defects founds by QuARS is 3.24 times higherThe goals of an empirical study depend on the matu-than those found by the human reviewer (Table 5), the rity of the technology under investigation. The QuARScapability of finding Class-Q defects is thus about tentrials performed during the evolution of the tool fulfilledtimes higher for QuARS.the first three goals listed in Subsection 3.2. In partic-4) M4 indicates that the mean cost (in terms of effort ular, the goal of observing phenomena (i.e, achievingrequired) of detecting an FP is about 6.5 times higheran understanding of the real world, which in our casethan the cost of identifying a real defect (179 false pos- means understanding the needs for quality analysis ofitives found in 4.25 hours vs. 560 defects detected in 2 NL requirements) was achieved by means of interviewshours). Thus if the rate between the FP and the realwith practitioners, by studying the literature, as well asdefects is higher than 5, QuARS is less effective than the the authors' experience in conducting software processhuman review (this value has been calculated consider-assessments according to the ISO 15504 standardlo] .ing that, as we saw in point 1 above, the effectivenessThe validation of the idea was obtained by using theis 32 times higher for QuARS). False positives are thus first prototypes of the tool on sample documentsl9]. De-potentially dangerous for the efectiveness of QuARS.bugging was done via trials made using real industrialSince the eficacy of the tool in finding defects may de-require中国煤化工I versions of the toolpend on the adequacy-to-the-context of the QuARS dic- sent tqtionaries, tailoring the dictionaries is critical. Yet, theThYHC N M H Genough to be empir-study did not allow for the user's learning curve but the ically evalduea Lo verly 1Is use山project.user did report that he was able to use the tool andTestbed. The technical infrastructure used for thehandle the false positives in a significantly better man-empirical study (the QuARS tool) is independent of thener after the first four hours of use. For this reason,context where it is run, i.e., the effect of using QuARS in346J. Comput. Sci. & Technol, May 2007, Vol.22, No.3the laboratory is the same as its use in a real industrialhuman review. Errors belonging to this class can causecontext. However, the subject performing the analysis serious problems in the software development as high-with the tool has a very important impact on the ex-lighted by the human inspector's opinions in the caseperiment. In fact, in our case the analysis with QuARSexamined in this paper. Reviewing requirements doc-was made by a chief engineer involved in the specif-uments by humans is a costly activity requiring exper-cation and development of the product rather than atise in the application domain as well as technical skills.researcher, because the engineer has the skills needed toThis paper has addressed the question of how to im-describe, evaluate and classify the defects found. This prove the review of NL requirements both in terms ofguarantees the correct identification of false positivesefficiency and effectiveness. We have described an em-among the QuARS outcomes. Moreover, the results ob-pirical study aiming at assessing the benefits of using antained with QuARS are compared with the results ofautomatic tool in the requirements analysis process.the analysis made by an expert human reviewer. TheseThe results of the empirical study presented in thischaracteristics ensure the validity of the adopted testbedpaper have shown that using an automatic tool (calledfor the purposes of the empirical study.QuARS and developed at the ISTI-CNR) which ad-Measurements. In such an experiment the simpledresses linguistic ambiguity can significantly improvecount of defects detected is not suficiently significant the requirements analysis process. In fact, the bene-because the cost of technology (including the trainingfits in terms of quantity and cost effectiveness of defectseffort) as well as the nature of the outputs produced (infound have been observed and compared to human in-this case the number of false positives) may affect the spection. However, human inspection cannot be fullyvalidity of the overall results. The cost of the technologyreplaced by a tool like QuARS since humans can pointis considered and the number of false positives is takenout semantic deficiencies that QuARS is not able for theinto account for the measurements reported and thismoment to address.guarantees the soundness of the measurements made inThe analysis of the results of the empirical studythe experiment.also highlights some critical points. In particular, theInternal Validity. There was no reason for the per-false positives phenomenon (i.e, the presence of defectssons involved to change their behaviour during the ex-pointed out by the tool but not considered real defectsperiment nor do any other threats discussed in Subsec-by the analyst) cannot be avoided with a technology liketion 3.5 have relevance in the performed empirical stud-QuARS, but it can be mitigated by tailoring the QuARSes. In fact, the senior engineer that used QuARS had nodictionaries.interest in preferring one method over another and theOne way to integrate the tool into the natural lan-human reviewer was not directly involved in the studyguage requirements analysis process is to use it beforebecause he performed the requirements inspection priorthe human review in order to reduce the effort the re-to the experiment (he did not know anything about theviewer spends on lexical and syntactic defects. The hu-experiment). Construct validity is guaranteed for phaseman inspector can then focus on the semantic and prag-1 of the experiment because the data collected and anal-matic considerations of ambiguity and requirements cor-ysed are objective and thus they exclude any prejudicerectness.or bias. Some concerns exist for the validity of phaseAcknowledgements Special thanks to Fabrizio2. The validity of the assessment of hypothesis H.3 canFabbrini, Mario Fusani and Stefania Gnesi of the ISTI-be affected by the fact that only one document (evenCNR for their contribution in the analysis of the datathough it was large) was used. Points 3 and 5 in Sectiongathered in the empirical study.6 can thus be considered as having a lower confidencelevel than the others. Finally, the external validity ofReferencesthis empirical experiment is guaranteed because thereis no factor that would affect whether the results of the[1] Gray J, Schach S. Constraint animation using an object-study can be generalised to the population of interest.oriented declarative language. In Proc. the 38th Annual ACMIn fact, the profle of the persons involved in the studySE Conference, Clemson, sC, April 7~8, 2000, pp.1~10.makes the results compliant with a real working sce-[2] Kamsties E, Berry D M, Paech B. Detecting ambiguities inrequirements documents using inspections. In Proc. Pirstnario.Workshop on Inspection in Software Engineering ( WISE'01),Paris, France, July 23, 2001, Lawford M, Parnas D L (eds.),8 ConclusionSoftware Quality Research Lab at McMaster University inCanada, pp.68~80.The analysis of requirements made by means of hu-3] Sommerville I, Sawyer P. Requirements Engineering: A Good~ns, 1997.man inspections is a common practice in the software in-dustry. This kind of analysis highlights different classesHG中国煤化工0. Lond Adionof defects ranging from mere lexical errors to consis-CHCNMHGtechnologies.tency problems such as the presence of contradictory[6] Lami G. QuARS: A tool for analyzing requirements. Soft-requirements. Problems due to poor writing style (e.g.,ware Engineering Technical Report CMU/SEI-2005-TR-014,ambiguity, difcult readability) are also addressed in aSoftware Engineering Institute, USA, September 2005.Giuseppe Lami et al: An Empirical Study on the Impact of Automation347[7] Kimberly s Wasson. Requirements metrics: Scaling up, In [18] Basili v R, Rombach H D. The TAME project: TowardsProc. 2nd International Workshop on Comparative Evalua-improvement-orientated software environments. IEEE Trang-tion in Requirements Engineering ( CERE'04), Kyoto, Japan,actions in Software Engineering, 1988, 14(6): 758~773. .September 2004, Vincenzo Gervasi, Didar Zowghi, SusanElliot Sim (eds.), FIT-UTS, Sydney, ISBN: 1-86365 866-1,pp.51~55.Giuseppe Lami received his8]} Seaman C B. Qualitative methods in empirical studies of soft-M.Sc. degree in computer science,ware engineering. IEEE Transaction on Software Engineer-Ph.D. degree in information engineer-ing, 1999, 25(4): 557~572.ing from the University of Pisa, Italy9] Fabbrini F, Fusani M, Gnesi s, Lami G. The linguistic ap-in 1995 and 2004 respectively. Heproach to the natural language requirements quality: Ben-joined the L.S.T.I- C.N.R. in 2000 af-efits of the use of an automatic tool. In Proc. 26th AnnualIEEE Computer Society- NASA Goddard Space Flight Cen-ter five years in industry. He is res-JSA.ident afiliate of the Software Engi-ter Software Engineering Workshop, Greenbelt, MA, USANovember 27~29, 2001, pp.97~105.neering Institute since 2003. His re[10] Information Technology - Software Process Asessment.search activity is concentrated on re-ISO/IEC TR 15504:1998, pp.1~9.quirements engineering, and software quality evaluation. He[11] Gnesi S, Lami G, Trentanni G, Fabbrini F, Fusani M. Anperformed several software process assessments (accordingautomatic tool for the analysis of natural language requireto the ISO/IEC15504 standard) in the automotive softwarements. International Journal of Computer Systems Scienceindustry. Since 1998 he is lecturer at the University of Pisaand Engineering, Special Issue on Automated Tools for. Re-and at the University of Florence.quirements Engineering. Leicester: CRL Publishing Ltd, UK,2005, 20(1): 53~62.2] Coleman M, Liau T L. A computer readability formula de-signed for machine scoring. Journal of Applied Psychology,Robert w. Ferguson is an1975, 60(2): 283~284.experienced software developer and[13] Lami G, Trentanni G. An automatic tool for improving theproject manager who joined the SEIquality of software requirements.ERCIM News, 2004, 58:after 30 years in industry. His ex-18~19.perience includes development and[14] Kitchenham B, Pickard L, Plfeeger s L. Case studies for meth-project management in real-timeods and tool evaluation. IEEE Software, 1995, 12(4): 52~62.([15] Kitchenham B. Evaluating software engineering methods andfight, manufacturing control systems,tools, part 1: The evaluation context and evaluation meth-large databases, and systems integra-ods. ACM SIGSOFT Software Engineering Notes, Jan.1996,tion projects. He has also led process21(1): 11~14.improvement activities at two compa-[16] Wohlin C et al. Experimentation in Software Engineering:nies and has significant experience in implementing measure-An Introduction. International Series in Software Engineer-ment programs. Mr. Ferguson is a Senior Member of IEEEing, Springer Ed, 2000, Vol.6.[17] Fenton N. Software Metrics A Rigorous Approach. London:and has certification from the Project Management InstituteChapman and Hall Ed, 1991.(PMI) as a Project Management Professional (PMP).中国煤化工MYHCNMHG

论文截图
版权:如无特殊注明,文章转载自网络,侵权请联系cnmhg168#163.com删除!文件均为网友上传,仅供研究和学习使用,务必24小时内删除。