Proceedings of the International Multiconference on Computer Science and Information Techno

2023-09-14 来源：华佗健康网

ProceedingsoftheInternationalMulticonferenceonComputerScienceandInformationTechnologypp.355–362ISSN1896-7094c2006PIPS󰀁

356JanKwiatkowski,MarcinPawlik,DariuszKonieczny

Thesecondfrequentlyusedmetricisspeedup,whichcapturestherelativebeneﬁtofsolvinggivenprob-lemusingparallelsystem.Therearediﬀerentspeedupdeﬁnitions.Generallythespeedupisdeﬁnedastheratioofthetimeneededtosolvetheproblemonasingleprocessortothetimerequiredtosolvethesameproblemonparallelsystemwithpprocessors.Dependingonthewayinwhichthesequentialtimeismeasuredwecandistinguishabsolute,realandrelativespeedups.Inthepapertherelativespeedupisusedwiththesequentialtimedeﬁnedasatimeofexecutingaparallelprogramononeoftheprocessorsoftheparallelcomputer.Theoretically,speedupcannotexceedthenumberofprocessorsusedduringprogramexecution,howeverdiﬀerentspeedupanomaliescanbeobserved.Intheoryaspeedupanomalyoccurswhentheprogramdoesnotexecuteinthewaypredictedbytheperformancemodel.Speciﬁcallythespeedupcanbelargerthanthenumberofavailableprocessors—socalledsuperlinearspeedup.Therearethreediﬀerentreasonsoftheabovephenomenon:

–algorithmdependantanomaly—causedbytheinternalalgorithmstructure,inotherwordsthewayhowtheproblemisparallelized,forexampleinsomeparallelsearchalgorithmsifasearchtreecontainssolutionsatdiﬀerentdepths,thenafterdistributionofthistreeamongdiﬀerentprocessors,thesolutioncanbefoundbyexploringfewernumberofnodes,

–hardwaredependantanomaly—causedbysomespeciﬁchardwarefeatures,forexamplebythein-terconnectionnetwork,sizeofcache,internalmemory,etc.The“cacheeﬀect”isanexampleofthisanomaly,whenaprogramisexecutedonalargenumberofprocessors,theproblemsizeisreducedandallneededdatacanbeplacedintherelativelyfastercachememory,itcausesthereductionofexecutiontimeandeﬃciencygrowthevenover1,

–executionenvironmentdependantanomaly—causedbytheexecutionenvironment,mainlyoperat-ingsystemfeatureslikeschedulingsystem,forexample,whenduringparallelprogramexecutionprocessorswithdiﬀerentperformanceareused.Theperformancemetricspresentedearlierdonottakeintoaccounttheutilizationofprocessorsintheparallelsystem.Thenextmetriccalledeﬃciencyofaparallelprogramisdeﬁnedasaratioofspeeduptothenumberofprocessors.Intheidealparallelsystemtheeﬃciencyisequaltoone.Inpracticeeﬃciencyoftheparallelsystemsisbetweenzeroandone,howeverwhenexecutionanomaliesoccurs,theeﬃciencycanbegreaterthanone.Thenextusefulmetricsisthescalabilityoftheparallelsystem[1],[2].Ingeneralitisameasureofitscapanilitytoincreasespeedupinproportiontothenumberofprocessors.Therearetwowaysinwhichscalabilityanalysiscanbecarriedout:withﬁxedandscaledproblemsize.Theﬁxedproblemsizescalabilityanalysisanswersthequestion:whatisthefastestIcansolveproblemAoncomputerX[1].Inthiscasediﬀerent(mainlyhardwaredependant)executiontimeanomaliescanoccur.Intheirpresencethesecondapproach,scaledproblemsizeanalysiscanbeused.Inthiscaseitischeckediftheeﬃciencycanbekeptatthesamelevelwhentheproblemsizeandthenumberofprocessorsisconcurrentlyincreased.Onecansaythatasystemisscalablewhenforincreasingnumberofprocessorsandasizeoftheproblemtheeﬃciencyisthesame.Inthepapertheinﬂuenceofthesetwoapproachesonexistingspeedupanomaliesisshown.Thelastmetricusedinthepaperisthecomputationalthroughput,deﬁnedastheamountofdataprocessedbyasingleprocessorinaunitoftime.Duringthethroughputanalysisonlythepartofdatadistributedbetweentheprocessorswiththesizedirectlydependantontheirnumbercanbetakenintoconsiderationiftheremainingdatarepresentstheconstanttimefactor.

3ParallelizationoftheSOMAlgorithm

TheKohonenSelfOrganizingMaps(SOM)[3]arecommonlyemployedtoprocesslargeinputdatabuttheireﬀectiveworkingabilitiescanbeachievedonlyafteratime-consumingprocessoflearning.Hardwarerequirementsneededforusageofevenmoderatesizenetworkscaneasilyexceedavailableresources.Parallelalgorithmsoﬀerthesolutiontothisproblem.Theydividetheprocessofcomputationsbetweenmanyindependentlyworkingsmallcomputersorutilizethewholecomputationalpoweroflarger,multiprocessormachines.Whileoﬀeringapossibilitytoreducethetimeneededtocreatetheresultingnetwork,workonbiggerdatasamplesorpreparemoredetailedresults,theypreserveimportantpropertiesofsequentialSOMimplementations.

Intheoriginalsequentialalgorithmthewinnersearchandwinnersneighbours’weightsmodiﬁcationareperformedon-lineineverylearningstep.Thelearningparametersupdateisrunonlyonceaftereachpresentationofallthelearningsetvectors(ateveryepoch).Theparallelalgorithmvariantsaregenerallybasedondivisionofeitherthelearningset(learningsetparallelization)orthenetwork(networkparallelization)betweentheprocessors[7].Thespeedupofparallelimplementationsutilizingon-line

ParallelProgramExecutionAnomalies357

algorithmislimitedbythefrequentcommunication.Tolowerthecommunicationoverhead,theoﬀ-line(withweightsupdateperformedonceattheendofeachepoch)algorithmcanbeemployed.

4TheoreticalPerformanceModel

Toanalyzetheperformanceofthealgorithmitsperformancemodelshouldbeconstructed.Inthepara-graphsbelowtheperformancemodelofthesingleoﬄinealgorithmchosenforthefurtherevaluation—thenetworkoﬄineparallelization[4]ispresented.

Inthealgorithmthewinnersearchprocedureisdividedbetweenpprocessorsbyinstructingthemtoﬁndlocalwinnersinptimessmallerpartofthenetwork(themini-network),approximatelyreducingthesearchtimebyafactorofp.Thewinnerpositionisdeterminedandrememberedineverystepandtheinformationtransferfollowedbyneighbors’weightsmodiﬁcationisperformedattheendofeveryepoch.

Theexecutiontimeoftheparallelalgorithmdependsonthecommunicationtimeandthetimeofcomputationsperformedoneachoftheutilizedprocessors.Inthenetworkoﬄineparallelizationalgorithmthecomputationtimeonthesingleprocessorconsistsofsearchingthewinningneuronandmodifyingthewinneranditssurroundingsweights.Thetimeofthewinnersearchoperationsislinearlydependantonthenumberofneurons.Thenumberofmodiﬁcationdependsontheepochindex.Theradiusofthesurroundingschangesinaccordancetotheformular(e)=Sa/(Sb+e),whereSa,Sbareaprioriselectedconstantsusedtocontrolthealgorithmbehaviorandeistheindexofthecurrentepoch.Itcanbeassumedthatfortypicalalgorithmexecutioneveryneuronhasanequalprobabilityofbeingthewinner.Theaveragenumberofneuronsinthewinnersurroundingss(e,n)isdependantontheradiusofthesurroundingsr(e)andthenetworksizen.Theprecisenumberofneuronsinthewinnersurroundingsdependsonthepositionofthewinnerinthenetworkandthenetworksizebutwhentheassumptionoftheequalprobabilityofbeingthewinneramongtheneuronsisassumed,s(e,n)hasanupperboundof(2r(e)+1)2.ForthenumberoflearningsetelementsequalN,itcanbeprovedthatthenumberof

󰀁E−1

neuronsmodiﬁedduringEepochsisdescribedbytheequationS(e,n)=Ne=0s(e,n).TheshapeoftheS(e,n)functionandtheexperimentalresultsfortheexecutionparametersutilizedforexperimentalevaluation(Sa=16,Sb=2,E=16,N=1152)ispresentedonﬁg.1.

Fig.1.Numberofmodiﬁcationsinnetworkoﬄineparallelization

TheruntimeofthesequentialSOMalgorithmTseqdependsonthenumberofneuronsn,thetimeoftheelementarysearchoperationτc,thenumberoflearningsetelementsNandthenumberofepochsE.Ifthetimeofmodifyingasingleneuronweightsisτsmandthenumberofneuronsinthewinner’ssurroundingsiss(e),thetimetmodduringwhichtheweightsofthewinneranditssurroundingneurons

󰀁E

aremodiﬁedisgivenbytheequationtmod=τsme=0s(e).TheresultingequationdescribingtheexecutiontimereadsTseq=nτcNE+tmodN

Intheparallelversionofthealgorithm—thenetworkoﬄineparallelization—thecommunicationphaseispresentaftereachepoch.Beforestartingcommunication,eachprocessorcomputesandstoresthe

358JanKwiatkowski,MarcinPawlik,DariuszKonieczny

vectorofwinnersfoundforeachofthelearningsetelements.Inthecommunicationphasethisvectorisdistributedbetweenprocessorsusingreductionoperation,resultingindeterminingoneglobalwinnerforeverylearningsetelement.Afterthisphaseeachprocessormodiﬁesthewinneranditssurroundingsweightsiftheyareinitsmini-network.TheparallelruntimeTpforthisalgorithmisgivenbytheequationTp=(EnτcN/p+EtmodN/p)+(Etr(N,p)),wheretr(N,p)isatimeforcommunicationwithreductionforNwinnersandpprocessors.Theshapeofthecurverepresentingthealgorithmparallelruntimeispresentedonﬁg.2.

Fig.2.Networkoﬄinealgorithmexecutiontime

Thenumberofoperationsneededforthewinnersearchisconstantwithrespecttothenumberofprocessors.Withthegrowthofthenumberofprocessors,thenumberofwinnersinthemini-networkofasingleprocessordecreases.Whenthecommunicationoverheadissmall(i.e.forsmallnumberofpro-cessors)thisdecreaseresultsinthedecreaseoftheoverallexecutiontime.Afteritreachesitsminimum,theraisingcommunicationoverheadstartstoresultintheexecutiontimegrowth.

5ExperimentalEvaluationoftheParallelSOMAlgorithm

Toconﬁrmthecorrectnessofthetheoreticalanalysispresentedinsection4theseriesofexperimentswasperformed.ThetestswereexecutedontheCumuluscluster[5]consistingof36homogenousIBMPCspace-sharedSempron1.7GHz,512MBRAMnodes.Theresultspresentedinthepaperarebasedonthevaluesfromtheseriesofatleastﬁvetestruns.Ifnotstateddiﬀerently,theaveragedvaluesarepresented.

IntheexperimentsbasedontheﬁxedproblemtechniqueforlargeandmoderateSOMnetworksizes(48x48andabove)thecharacteristicsofapplicationeﬃciencycloselymatchedthetheoreticalpredictions.Forsmallernetworksizesthesuperlinearspeedupappeared(ﬁg.3).Theparallelexecutioneﬃciencylargelyexceeding1suggeststhatforthenumberofnodesequalandlargerthan8theutilizationoftheprocessorcachespeedsupthememoryaccessoperations.Inthetestedalgorithm—thenetworkoﬄineparallelization—withthegrowingnumberofutilizedprocessorsthenetworksizeallocatedforoneprocessorgetsproportionallysmaller.Inthealgorithmimplementationasingleneuronisrepresentedbythevectorof6178-bytelongﬁelds.TheCPUsutilizedhave128kBL1and256kBL2cachesizes.Ananalysisofthenetworksizevaluesgatheredinthetable1showsthatforthe24x24networktheallocatedmemorygetsclosetothecachesizewhenthenumberofprocessorsreaches8.For48x48networkthisvalueispresentfor32processors(thustheﬁnaleﬃciencyraise).When96x96andlargernetworksareused,evenforthelargesttestednumberofprocessors,thenetworksizeistoobigtocausethecacheutilizationrelatedeﬃciencygrowth.

ParallelProgramExecutionAnomalies359

Fig.3.EﬃciencyoftheparallelSOMalgorithmTable1.NetworksizeinkBperprocessor

No.ofproc.

1832

Net.size48x48

111061388347

360JanKwiatkowski,MarcinPawlik,DariuszKonieczny

Fig.4.Executioneﬃciencyinrelationtothenumberofprocessors

Fig.5.Executioneﬃciencyinrelationtothedatasizeallocatedonasingleprocessor

Fig.6.Throughputinthecachearea

ParallelProgramExecutionAnomalies361

Fig.7.Throughputatthecachesizelimitsandmemoryarea

dependantanomalyiscausedbybettercachespaceutilization,resultinginahighercachehitratioandtheloweroverallmemoryaccesstime.

Intheseconddatasizearea,wherethedatasizestartsexceedingthecachesize(theupperthreecurvesonﬁg.7),thethroughputgetslowerwiththedatasizegrowth.Thisbehaviorisnaturalsincewiththedatagrowththeamountofdatathatmustbeaccessedfromthememoryraises,resultingingrowingnumberofcachemisses.Theupperthreecurvesontheﬁgure7presenttheresultsofasingletestrun.Inthisareathesoftwareenvironmentbehaviorcanchangetheprogramcacheutilizationeﬃciency,resultingintheexecutionenvironmentdependantanomalies.Whentheresultsareaveragedovertheseriesofexperiments,theseanomaliesareindistinguishable.

Thethirddatasizearea,representingthedatasizesmuchlargerthanthecachesize,showsmuchmoreconsistentthroughputbehavior(thelowersixcurvesonﬁg.7).Whenthedetailedbehaviorisanalyzed(ﬁg.8)itcanbeseenthatlikeinthecacheareathedatasizegrowthresultsinthethroughputgrowth.Althoughthesourceofthiseﬀectisthesameasinthecacheareaitssigniﬃcanceismuchsmallerherebecauseforlargerdatasizesmuchsmallerisalsotheinﬂuenceofthecacheutilization.Inthememoryutilizationarea,asshownintheanalysispresentedinsection4,thelargernumberofprocessorsthesmalleristhesumofcommunicationandmodiﬁcationtime.Thisalgorithmdependantanomalyconstitutesthesourceoftheinitialthroughputgrowthwhenthelargernumberofprocessorsisutilized.

6ConclusionsandFutureWork

Inthepaperthreetypesofexecutionanomalies—hardware,algorithmandexecutionenvironmentsde-pendant,wereidentiﬁedintheparallelSOMalgorithm.Utilizingthescalableproblemsizetechniqueitwaspossibletodescribethedependencebetweentheinputdatasizeandtheunexpectedalgorithmbehavior.Thistechniqueprovedalsotobeavaluabletoolthatcanbeutilizedtoselectthecorrectexecu-tionparameterstoavoidthehardwaredependantanomalies.Thelimitationsofthepapersizepermittedtopresentthediscussionbasedonlyonasinglealgorithm.Inthefutureworksthescalableproblemsizetechniqueutilizationtotheanalysisofbroaderclassofalgorithms,speciﬁcallytheoneswithanonlinearcomplexity,togetherwiththeapplicationofthistechniqueintothegranularitybasedevaluation[6]andperformancepredictionmethodswillbepresented.

Acknowklegements.ThisresearchwaspartiallysupportedbybytheEuropeanCommunityFrame-workProgramme6projectDeDiSys,contractNo004152.

References

1.FosterI.:DesigningandBuildingParallelPrograms,Addison-WesleyPub.,1995(alsoavailableathttp://www.mcs.anl.gov/dbpp/text/book.html).

362JanKwiatkowski,MarcinPawlik,DariuszKonieczny

Fig.8.Throughputinthememoryarea(details)

2.GramaA.,GuptaA.,KumarV.:Isoeﬃciency:MeasuringtheScalabilityofParallelAlgorithmsandArchitec-tures,IEEEParallel&DistributedTechnology,August1993,pp.12–21.

3.Kohonen,T.:TheSelf-OrganizingMap,ProceedingsofIEEE,1985,vol.73pp.1551–1558.

4.KwiatkowskiJ.,PawlikM,Konieczny.D.,Markowska-KaczmarU.:PerformanceEvaluationofDiﬀerentKo-henenNetworkParallelizationTechniques,TobepublishedinProceedingsofParelec’06Intl.Conf.

5.KwiatkowskiJ.,PawlikM,WyrzykowskiR.,KarczewskiK.:Cumulus—DynamicClusterAvailableunderCLUSTERIX,Proc.ofCracowGridWorkshop2005,Cracow2006,pp.82-87.

6.KwiatkowskiJ.:EvaluationofParallelProgramsbyMeasurementofitsGranularity,ProceedingsofPPAM’01InternationalConferenceNaleczow,Poland,September9–122001,LNCS,Springer-VerlagBerlinHeidelberg2002,pp.145–153.

7.LawrenceR.,AlmasiG.S.,andRushmeierH.E.:Ascalableparallelalgorithmforselforganizingmapswithapplicationstosparsedataminingproblems,1999,DataMiningandKnowledgeDiscovery,vol.III,pp.171–95.8.SahniS.,ThanvantriV.:PerformanceMetrics:KeepingtheFocusonRuntime,IEEEParallel&DistributedTechnology,spring1996,pp.43–56.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文

全部频道

Proceedings of the International Multiconference on Computer Science and Information Techno