356JanKwiatkowski,MarcinPawlik,DariuszKonieczny
Thesecondfrequentlyusedmetricisspeedup,whichcapturestherelativebenefitofsolvinggivenprob-lemusingparallelsystem.Therearedifferentspeedupdefinitions.Generallythespeedupisdefinedastheratioofthetimeneededtosolvetheproblemonasingleprocessortothetimerequiredtosolvethesameproblemonparallelsystemwithpprocessors.Dependingonthewayinwhichthesequentialtimeismeasuredwecandistinguishabsolute,realandrelativespeedups.Inthepapertherelativespeedupisusedwiththesequentialtimedefinedasatimeofexecutingaparallelprogramononeoftheprocessorsoftheparallelcomputer.Theoretically,speedupcannotexceedthenumberofprocessorsusedduringprogramexecution,howeverdifferentspeedupanomaliescanbeobserved.Intheoryaspeedupanomalyoccurswhentheprogramdoesnotexecuteinthewaypredictedbytheperformancemodel.Specificallythespeedupcanbelargerthanthenumberofavailableprocessors—socalledsuperlinearspeedup.Therearethreedifferentreasonsoftheabovephenomenon:
–algorithmdependantanomaly—causedbytheinternalalgorithmstructure,inotherwordsthewayhowtheproblemisparallelized,forexampleinsomeparallelsearchalgorithmsifasearchtreecontainssolutionsatdifferentdepths,thenafterdistributionofthistreeamongdifferentprocessors,thesolutioncanbefoundbyexploringfewernumberofnodes,
–hardwaredependantanomaly—causedbysomespecifichardwarefeatures,forexamplebythein-terconnectionnetwork,sizeofcache,internalmemory,etc.The“cacheeffect”isanexampleofthisanomaly,whenaprogramisexecutedonalargenumberofprocessors,theproblemsizeisreducedandallneededdatacanbeplacedintherelativelyfastercachememory,itcausesthereductionofexecutiontimeandefficiencygrowthevenover1,
–executionenvironmentdependantanomaly—causedbytheexecutionenvironment,mainlyoperat-ingsystemfeatureslikeschedulingsystem,forexample,whenduringparallelprogramexecutionprocessorswithdifferentperformanceareused.Theperformancemetricspresentedearlierdonottakeintoaccounttheutilizationofprocessorsintheparallelsystem.Thenextmetriccalledefficiencyofaparallelprogramisdefinedasaratioofspeeduptothenumberofprocessors.Intheidealparallelsystemtheefficiencyisequaltoone.Inpracticeefficiencyoftheparallelsystemsisbetweenzeroandone,howeverwhenexecutionanomaliesoccurs,theefficiencycanbegreaterthanone.Thenextusefulmetricsisthescalabilityoftheparallelsystem[1],[2].Ingeneralitisameasureofitscapanilitytoincreasespeedupinproportiontothenumberofprocessors.Therearetwowaysinwhichscalabilityanalysiscanbecarriedout:withfixedandscaledproblemsize.Thefixedproblemsizescalabilityanalysisanswersthequestion:whatisthefastestIcansolveproblemAoncomputerX[1].Inthiscasedifferent(mainlyhardwaredependant)executiontimeanomaliescanoccur.Intheirpresencethesecondapproach,scaledproblemsizeanalysiscanbeused.Inthiscaseitischeckediftheefficiencycanbekeptatthesamelevelwhentheproblemsizeandthenumberofprocessorsisconcurrentlyincreased.Onecansaythatasystemisscalablewhenforincreasingnumberofprocessorsandasizeoftheproblemtheefficiencyisthesame.Inthepapertheinfluenceofthesetwoapproachesonexistingspeedupanomaliesisshown.Thelastmetricusedinthepaperisthecomputationalthroughput,definedastheamountofdataprocessedbyasingleprocessorinaunitoftime.Duringthethroughputanalysisonlythepartofdatadistributedbetweentheprocessorswiththesizedirectlydependantontheirnumbercanbetakenintoconsiderationiftheremainingdatarepresentstheconstanttimefactor.
3ParallelizationoftheSOMAlgorithm
TheKohonenSelfOrganizingMaps(SOM)[3]arecommonlyemployedtoprocesslargeinputdatabuttheireffectiveworkingabilitiescanbeachievedonlyafteratime-consumingprocessoflearning.Hardwarerequirementsneededforusageofevenmoderatesizenetworkscaneasilyexceedavailableresources.Parallelalgorithmsofferthesolutiontothisproblem.Theydividetheprocessofcomputationsbetweenmanyindependentlyworkingsmallcomputersorutilizethewholecomputationalpoweroflarger,multiprocessormachines.Whileofferingapossibilitytoreducethetimeneededtocreatetheresultingnetwork,workonbiggerdatasamplesorpreparemoredetailedresults,theypreserveimportantpropertiesofsequentialSOMimplementations.
Intheoriginalsequentialalgorithmthewinnersearchandwinnersneighbours’weightsmodificationareperformedon-lineineverylearningstep.Thelearningparametersupdateisrunonlyonceaftereachpresentationofallthelearningsetvectors(ateveryepoch).Theparallelalgorithmvariantsaregenerallybasedondivisionofeitherthelearningset(learningsetparallelization)orthenetwork(networkparallelization)betweentheprocessors[7].Thespeedupofparallelimplementationsutilizingon-line
ParallelProgramExecutionAnomalies357
algorithmislimitedbythefrequentcommunication.Tolowerthecommunicationoverhead,theoff-line(withweightsupdateperformedonceattheendofeachepoch)algorithmcanbeemployed.
4TheoreticalPerformanceModel
Toanalyzetheperformanceofthealgorithmitsperformancemodelshouldbeconstructed.Inthepara-graphsbelowtheperformancemodelofthesingleofflinealgorithmchosenforthefurtherevaluation—thenetworkofflineparallelization[4]ispresented.
Inthealgorithmthewinnersearchprocedureisdividedbetweenpprocessorsbyinstructingthemtofindlocalwinnersinptimessmallerpartofthenetwork(themini-network),approximatelyreducingthesearchtimebyafactorofp.Thewinnerpositionisdeterminedandrememberedineverystepandtheinformationtransferfollowedbyneighbors’weightsmodificationisperformedattheendofeveryepoch.
Theexecutiontimeoftheparallelalgorithmdependsonthecommunicationtimeandthetimeofcomputationsperformedoneachoftheutilizedprocessors.Inthenetworkofflineparallelizationalgorithmthecomputationtimeonthesingleprocessorconsistsofsearchingthewinningneuronandmodifyingthewinneranditssurroundingsweights.Thetimeofthewinnersearchoperationsislinearlydependantonthenumberofneurons.Thenumberofmodificationdependsontheepochindex.Theradiusofthesurroundingschangesinaccordancetotheformular(e)=Sa/(Sb+e),whereSa,Sbareaprioriselectedconstantsusedtocontrolthealgorithmbehaviorandeistheindexofthecurrentepoch.Itcanbeassumedthatfortypicalalgorithmexecutioneveryneuronhasanequalprobabilityofbeingthewinner.Theaveragenumberofneuronsinthewinnersurroundingss(e,n)isdependantontheradiusofthesurroundingsr(e)andthenetworksizen.Theprecisenumberofneuronsinthewinnersurroundingsdependsonthepositionofthewinnerinthenetworkandthenetworksizebutwhentheassumptionoftheequalprobabilityofbeingthewinneramongtheneuronsisassumed,s(e,n)hasanupperboundof(2r(e)+1)2.ForthenumberoflearningsetelementsequalN,itcanbeprovedthatthenumberof
E−1
neuronsmodifiedduringEepochsisdescribedbytheequationS(e,n)=Ne=0s(e,n).TheshapeoftheS(e,n)functionandtheexperimentalresultsfortheexecutionparametersutilizedforexperimentalevaluation(Sa=16,Sb=2,E=16,N=1152)ispresentedonfig.1.
Fig.1.Numberofmodificationsinnetworkofflineparallelization
TheruntimeofthesequentialSOMalgorithmTseqdependsonthenumberofneuronsn,thetimeoftheelementarysearchoperationτc,thenumberoflearningsetelementsNandthenumberofepochsE.Ifthetimeofmodifyingasingleneuronweightsisτsmandthenumberofneuronsinthewinner’ssurroundingsiss(e),thetimetmodduringwhichtheweightsofthewinneranditssurroundingneurons
E
aremodifiedisgivenbytheequationtmod=τsme=0s(e).TheresultingequationdescribingtheexecutiontimereadsTseq=nτcNE+tmodN
Intheparallelversionofthealgorithm—thenetworkofflineparallelization—thecommunicationphaseispresentaftereachepoch.Beforestartingcommunication,eachprocessorcomputesandstoresthe
358JanKwiatkowski,MarcinPawlik,DariuszKonieczny
vectorofwinnersfoundforeachofthelearningsetelements.Inthecommunicationphasethisvectorisdistributedbetweenprocessorsusingreductionoperation,resultingindeterminingoneglobalwinnerforeverylearningsetelement.Afterthisphaseeachprocessormodifiesthewinneranditssurroundingsweightsiftheyareinitsmini-network.TheparallelruntimeTpforthisalgorithmisgivenbytheequationTp=(EnτcN/p+EtmodN/p)+(Etr(N,p)),wheretr(N,p)isatimeforcommunicationwithreductionforNwinnersandpprocessors.Theshapeofthecurverepresentingthealgorithmparallelruntimeispresentedonfig.2.
Fig.2.Networkofflinealgorithmexecutiontime
Thenumberofoperationsneededforthewinnersearchisconstantwithrespecttothenumberofprocessors.Withthegrowthofthenumberofprocessors,thenumberofwinnersinthemini-networkofasingleprocessordecreases.Whenthecommunicationoverheadissmall(i.e.forsmallnumberofpro-cessors)thisdecreaseresultsinthedecreaseoftheoverallexecutiontime.Afteritreachesitsminimum,theraisingcommunicationoverheadstartstoresultintheexecutiontimegrowth.
5ExperimentalEvaluationoftheParallelSOMAlgorithm
Toconfirmthecorrectnessofthetheoreticalanalysispresentedinsection4theseriesofexperimentswasperformed.ThetestswereexecutedontheCumuluscluster[5]consistingof36homogenousIBMPCspace-sharedSempron1.7GHz,512MBRAMnodes.Theresultspresentedinthepaperarebasedonthevaluesfromtheseriesofatleastfivetestruns.Ifnotstateddifferently,theaveragedvaluesarepresented.
IntheexperimentsbasedonthefixedproblemtechniqueforlargeandmoderateSOMnetworksizes(48x48andabove)thecharacteristicsofapplicationefficiencycloselymatchedthetheoreticalpredictions.Forsmallernetworksizesthesuperlinearspeedupappeared(fig.3).Theparallelexecutionefficiencylargelyexceeding1suggeststhatforthenumberofnodesequalandlargerthan8theutilizationoftheprocessorcachespeedsupthememoryaccessoperations.Inthetestedalgorithm—thenetworkofflineparallelization—withthegrowingnumberofutilizedprocessorsthenetworksizeallocatedforoneprocessorgetsproportionallysmaller.Inthealgorithmimplementationasingleneuronisrepresentedbythevectorof6178-bytelongfields.TheCPUsutilizedhave128kBL1and256kBL2cachesizes.Ananalysisofthenetworksizevaluesgatheredinthetable1showsthatforthe24x24networktheallocatedmemorygetsclosetothecachesizewhenthenumberofprocessorsreaches8.For48x48networkthisvalueispresentfor32processors(thusthefinalefficiencyraise).When96x96andlargernetworksareused,evenforthelargesttestednumberofprocessors,thenetworksizeistoobigtocausethecacheutilizationrelatedefficiencygrowth.
ParallelProgramExecutionAnomalies359
Fig.3.EfficiencyoftheparallelSOMalgorithmTable1.NetworksizeinkBperprocessor
No.ofproc.
1832
Net.size48x48
111061388347
360JanKwiatkowski,MarcinPawlik,DariuszKonieczny
Fig.4.Executionefficiencyinrelationtothenumberofprocessors
Fig.5.Executionefficiencyinrelationtothedatasizeallocatedonasingleprocessor
Fig.6.Throughputinthecachearea
ParallelProgramExecutionAnomalies361
Fig.7.Throughputatthecachesizelimitsandmemoryarea
dependantanomalyiscausedbybettercachespaceutilization,resultinginahighercachehitratioandtheloweroverallmemoryaccesstime.
Intheseconddatasizearea,wherethedatasizestartsexceedingthecachesize(theupperthreecurvesonfig.7),thethroughputgetslowerwiththedatasizegrowth.Thisbehaviorisnaturalsincewiththedatagrowththeamountofdatathatmustbeaccessedfromthememoryraises,resultingingrowingnumberofcachemisses.Theupperthreecurvesonthefigure7presenttheresultsofasingletestrun.Inthisareathesoftwareenvironmentbehaviorcanchangetheprogramcacheutilizationefficiency,resultingintheexecutionenvironmentdependantanomalies.Whentheresultsareaveragedovertheseriesofexperiments,theseanomaliesareindistinguishable.
Thethirddatasizearea,representingthedatasizesmuchlargerthanthecachesize,showsmuchmoreconsistentthroughputbehavior(thelowersixcurvesonfig.7).Whenthedetailedbehaviorisanalyzed(fig.8)itcanbeseenthatlikeinthecacheareathedatasizegrowthresultsinthethroughputgrowth.Althoughthesourceofthiseffectisthesameasinthecacheareaitssignifficanceismuchsmallerherebecauseforlargerdatasizesmuchsmallerisalsotheinfluenceofthecacheutilization.Inthememoryutilizationarea,asshownintheanalysispresentedinsection4,thelargernumberofprocessorsthesmalleristhesumofcommunicationandmodificationtime.Thisalgorithmdependantanomalyconstitutesthesourceoftheinitialthroughputgrowthwhenthelargernumberofprocessorsisutilized.
6ConclusionsandFutureWork
Inthepaperthreetypesofexecutionanomalies—hardware,algorithmandexecutionenvironmentsde-pendant,wereidentifiedintheparallelSOMalgorithm.Utilizingthescalableproblemsizetechniqueitwaspossibletodescribethedependencebetweentheinputdatasizeandtheunexpectedalgorithmbehavior.Thistechniqueprovedalsotobeavaluabletoolthatcanbeutilizedtoselectthecorrectexecu-tionparameterstoavoidthehardwaredependantanomalies.Thelimitationsofthepapersizepermittedtopresentthediscussionbasedonlyonasinglealgorithm.Inthefutureworksthescalableproblemsizetechniqueutilizationtotheanalysisofbroaderclassofalgorithms,specificallytheoneswithanonlinearcomplexity,togetherwiththeapplicationofthistechniqueintothegranularitybasedevaluation[6]andperformancepredictionmethodswillbepresented.
Acknowklegements.ThisresearchwaspartiallysupportedbybytheEuropeanCommunityFrame-workProgramme6projectDeDiSys,contractNo004152.
References
1.FosterI.:DesigningandBuildingParallelPrograms,Addison-WesleyPub.,1995(alsoavailableathttp://www.mcs.anl.gov/dbpp/text/book.html).
362JanKwiatkowski,MarcinPawlik,DariuszKonieczny
Fig.8.Throughputinthememoryarea(details)
2.GramaA.,GuptaA.,KumarV.:Isoefficiency:MeasuringtheScalabilityofParallelAlgorithmsandArchitec-tures,IEEEParallel&DistributedTechnology,August1993,pp.12–21.
3.Kohonen,T.:TheSelf-OrganizingMap,ProceedingsofIEEE,1985,vol.73pp.1551–1558.
4.KwiatkowskiJ.,PawlikM,Konieczny.D.,Markowska-KaczmarU.:PerformanceEvaluationofDifferentKo-henenNetworkParallelizationTechniques,TobepublishedinProceedingsofParelec’06Intl.Conf.
5.KwiatkowskiJ.,PawlikM,WyrzykowskiR.,KarczewskiK.:Cumulus—DynamicClusterAvailableunderCLUSTERIX,Proc.ofCracowGridWorkshop2005,Cracow2006,pp.82-87.
6.KwiatkowskiJ.:EvaluationofParallelProgramsbyMeasurementofitsGranularity,ProceedingsofPPAM’01InternationalConferenceNaleczow,Poland,September9–122001,LNCS,Springer-VerlagBerlinHeidelberg2002,pp.145–153.
7.LawrenceR.,AlmasiG.S.,andRushmeierH.E.:Ascalableparallelalgorithmforselforganizingmapswithapplicationstosparsedataminingproblems,1999,DataMiningandKnowledgeDiscovery,vol.III,pp.171–95.8.SahniS.,ThanvantriV.:PerformanceMetrics:KeepingtheFocusonRuntime,IEEEParallel&DistributedTechnology,spring1996,pp.43–56.
因篇幅问题不能全部显示,请点此查看更多更全内容