Home Decorating Ideas

home decoration online stores

[title]

all right,i see we have a full crowd today. let's get started. good afternoon, everyone. my name's eddie fong. i'm a group program managerin the exchange online team. i'm incredibly excited totalk to you about how we run our services online. i will also be inviting a few ofmy co-workers to join us on stage to tell you our story.

first of all, super clear,we're a bunch of engineers, engineers that make mistakes, learnas much as we can from them, and you'll hear andsee how we deal with sort of the chaos that sometimes comefrom running a global service. in fact, as you will notice, ourdata centers do not look like this even though marketing maywant you do believe so. what we will show you,though, are real pictures and things such as us loadingservers into our data center. and occasionally, when we makemistakes, pictures like this.

we're going to show you stuff, data, metrics that are unique to usrunning the exchange service. in fact, we'll talk about storiesthat are very authentic and transparent. but as the slide says, we are trying to be carefulabout what we will show and in some cases we're gonna take outdata, facts and other numbers that, in some cases, are going to be wrongby the time i finish speaking. that is simply a fact of howfast the service is growing.

let's talk a little bit aboutthe scale that we're operating at. and especially this is important forthose of you who may not have seen this talk before, and we'll gointo the details a little bit. we have essentially one ofthe world's largest networks. it is amazing to thinkthat we started from just one data center just a few years agowith over 100 data centers at this point and growing. in fact, we are acceleratingthe places in the road where we are puttingup a data center.

why do we do that? well, we're doing it becausewe wanna provide the best possible experience forour customers, especially when you think aboutperformance and latencies. and in some cases,specific rules and regulations for a country dictate that we are therein order to respect their rules and regulations. and it is amazing because you thinkabout how big the world is and the amount of network gear and

underlying infrastructure thatis required to build this, which we will go intoa little bit of shortly. this chart essentially shows thegrowth rate that we've seen since starting the service. as you can see, it is a very impressive exponentialgrowth kind of a chart. in fact, when we showed thislast year, we were saying, you know what, we're gonna repeatagain this year coming around. but at some point, you gotta figureout, are we going to slow down?

in fact, this year i am sad toreport that we finally slowed down, of course not. we continue to growat this crazy clip. i wanted to give you someperspective of what goes behind this sort of server growththat we experience. remember that picture i showed youearlier around the people loading up the data centers as serviceinto the data center? well, there are threethings to think about. first, service get delivered and

we have people in data centersthat rack up the machines. then they hand it off to anothergroup of people who ultimately put on additional software anddo further hardware configurations. before they finally headed offto us the engineering teams, in terms of what we call r tagthat we take to production. well, the reason i tell you this,is you notice in the middle here, this sort of what lookslike a slow down? now to be clear, the number ofservers we have isn't an exact approximation of the crazygrowth we're doing.

but it is a proxy. but what really happenedbehind the scenes were that we had a lot of serversthat landed in our data centers. but there was a bit of a log jamin terms of them pushing onwards. think of an assembly line where things kind of slowdown a little bit. well, as you can imagine, at thevolume of service that we're dealing with, the moment they kind offigured it out and fixed it, a huge wave of servers were unleashedonto the engineering teams.

we were literally hitting newrecords from a server throughput capacity, every month talking about,wow, we broke yet another record for deploying service production. and yet, at the same time, becausewe had slowed down a little bit in terms of getting to production,we had a lot of pressure coming from our senior leadership team to gofaster, and faster, and faster. as scotty would say, i'm givingher all she's got, captain. it was an incredible experiencewhen we continued to do so. this is an interesting slide ithought i would share with you for

the first time. it is a fairlysignificant milestone for us, over 100 billion messagesdelivered in a single month. in fact, the latest numbersare even higher, as we speak. just as a way of comparison,i did a quick search. turns out, that's about the entire year's worth of mail delivered bythe united states post office. so email has continued togrow at an insane clip, at least from ourservice perspective.

can i get a show of hands of folksin this room who are still thinking about moving to the cloud or are youall already committed to doing so? fairly a low number,that's about right. well, for those of you whoare considering moving to the cloud, you will no doubt be thinking, hey, there are three thingsi need to think about. what's the cost? what's the risk? do i really trust this cloud thing?

and the value you're getting. is this what i really want? is this how it's gonna improvemy company's productivity? well, here's what we reallydo behind the scenes. on cost,there are two things to think about. one is, we work really closelywith our azure friends to essentially consolidatewhat we call hardware skus. when you're able to do that, we essentially become one of theworld's largest buyer of hardware.

and when that happens, you get togo negotiate as low as pricing as possible without vendors and ultimately we pass thosesaving on to you guys. the second factor is this constant focus on whatwe call cogs reduction. after we buy the hardware, we havea lot of folks who spend an amazing amount of time focusing on how do wesqueeze more and more performance on the same existing hardware inorder to accommodate all these new features that we're buildingday in and day out.

and of course, our goal is tocontinue to push innovation while respecting all these conditions interms of the environment we're in. we believe that by pushing theenvelope for these three conditions, we'll be able to offer one ofthe most compelling services on the planet, given the scalethat we are running at. one of the first key learnings for us in running such a large serviceis that everything fails at scale. in fact, the focus has shiftedfrom predominantly where we look at root cause, and andunderstand what's really going on.

to be clear, we still do that. but when you run a serviceof this magnitude, the focus shifts to recovery. every minute that we're downmeans somebody somewhere is not getting their email ornot able to access the service. this is incredibly important whenwe now realize that we have banks, major parts of a government and even parts of countries are nowrelying on us to be up and running. we will talk about some ofthe subsystems that we've built,

including some of the servicephilosophies that drive and permeate the entire engine andorganization. and even more interestingis we have to do this with the highest levels of security andcompliance in order to ensure we meetthe requirements of our customers. the thing i do wanna point out,though, is even though you have all these subsystems, the mostimportant thing to think about is ensuring that you are drivinga very tight loop in between them. this is key.

we've learned that if you letthings sort of go stale for more than a week,which we'll, in fact, tell you a little bit later on,things break. we had funny issues and it's a key learning thatwe've also had to go apply. one of the key design principles forus is the fact that we keep four copies in multipleparts around the world. why is that important? well, it turns out this is how wego about protecting your data and

ensuring a certain level of up time. by having multiple copies,it lets us guarantee that we can deal with hardware failures,network failures, top of rack issues, andthe occasional data center outage. in fact, in order for us to handlean entire data center in any region around the world, we actuallyreserve enough capacity distribute across the entire service toenable us to run at this state. one of the most amazing experiencesi had when i first joined a team not too long ago was everyone ofus gets to be an incident manager.

where you're responsible forthe service. an my first time involved withone incident where, with a single click of a button our engineeringteam said, you know what? this particular are part ofthe us is not doing well, we need to fail out. and, literally, within 20 minutes,without anybody really noticing around the world, we madethe call, and suddenly we were running live failed over toanother part of united states. and that was mind bogglingfrom my perspective,

how truly amazing ourcapabilities are in doing so. well, as it turned out,this is incredibly important. this is a picture taken frominside one of our data centers. i'll tell you another story. as you know, like every competitor, we run the leanest, densest,greenest data centers as we can. and what this usually means is thatwe'll compromise on things like air conditioning, cuz you know,it consumes the most amount of electricity and thereforecosts in running data centers.

and in fact, we have turned someof these data centers cold. well, it turned out last year, the united states had itshottest summer on record. and coupled with the usual humidity,resulted in an interesting phenomenon in some of our datacenters, especially in the midwest. as we started the summer, we saw server after serverstart to clock down. essentially, the servers werehitting, what i would call their maximum operating temperatures,and the only rational way for

the service to keepgoing was to clock down. well, that̢۪s a huge problem forus, because we ware assuming that the servers run at maximum capacity,by clocking down we essentially lost half our available capacityin that data center. of course, us being engineers, we thought hey,why not just crank up the fans? that way we don't clog down. and of course, our friends in thedata center were saying no, no, no, no, don't do that.

that's gonna have a bad effect. and we're like, that's all right,we're smart, we'll figure that out. and of course, a few weeks laterthey send us this picture and, essentially, what had happened was,the air that we had generated was so high and so hot,that it caused the walls in the room with the servers tobecome separated from the ceiling. needless to say,we quickly undid that change, and we no longer ignored ourdata center friends. after we barely survived the summerin this sort of failed state where

we failed out of the data centers,we began working on a solution. what do we do? we cannot have this repeatitself next year's summer. of course, we were also assumingthat during the winter months, we'd at least be able to find somedata centers somewhere, maybe in the southern hemisphere, that wecould test out this in production. of course, luck as would have it,we had a very mild winter, and therefore then we didn'thave enough, essentially, temperature hot enough tosimulate the conditions.

so what do we do? as engineers, we were creative withour friends in data centers, and ultimately what we did was,we taped cardboard to each of our server racks as a wayto simulate the fail conditions. this worked perfectedbecause we had yet another record breaking summerheatwave this year that we were able to run at our full capacitywithout needing to fail out. let's zoom a little bit intowhat i call our service fabric. what you'll notice isthat it's made up of many

loosely connected systems. well, what it allows us to do is, that each of the systems can thenindependently scale on it's own. this level of flexibility letsus focus on a smaller set of components while being able to relyon the other parts being stable. in fact, some of our demos we'll gointo a little bit of how some of these systems work and to give youa flavor of what it feels like. on that note, i would like toinvite my friend karim batthish, who is one of the longrunning exchange engineers,

onto the stage to giveus a demo of what he does in terms of troubleshooting large scale problems. welcome, karim. >> thanks, eddy. i'm gonna show this is a real demoby not being logged into my laptop before i start. hi, could i see a showof hands quickly, how many of you are exchangeserver administrators or have been exchange serveradministrators one time?

awesome! and how many of you havemanaged mailbox servers, and done things like move servers forupgrades? a good number of you. i'm a group program manager, and i'mresponsible for the team that owns storage in exchange online as wellas in the exchange server product. and in many ways, i'm responsiblefor the administration of exchange's online storage system,much like, i guess, many of you are. now, as we run exchange online,

one of the differences thatmight exist between on-prem and the cloud is really the scaleat which we operate. as of this week, i think we crossedsomewhere around 400 petabytes of user data stored in mailboxes, across about 100,000mailbox servers. so, when we have to troubleshoot and figure out when thingsare going wrong, we've got some tools that we needed to buildto sort of keep ahead of that. and as eddie mentioned earlier,be able to quickly troubleshoot and

diagnose from the problems. one of the tools behind me isthe office service pulse, or what call osp. now, the reason ibrought up migration and moving mailboxes is because we dothat in the cloud quite a lot. there's a couple reasons that wemove mailboxes in the cloud and they're pretty much the sameas those for on prem. the first is,when you add new capacity. as you could see fromeddie's slide earlier,

we're adding servers likeit's going out of style. we're putting data centersinto new countries and creating whole new regions. and as a result we're movinguser mailboxes in there. and we invented a bunch ofreally interesting technology in the service that does automaticlow balancing based on space, based on balancing cpu acrossservers depending on peak times or where there's troughs in usage. as well as making sure that therearen't too many io intensive users

in one place or another, oreven just network load and network saturation. so we're always keeping tabs andtrack that sort of thing. the second reason that wemove mailboxes around is, when we do somethingfundamental to exchange that changes the way wewanna store the data. examples of that might be that we'vefound some amazing new way to reduce the io footprint of exchange, and need to change the way we write userdata to disc or to the database.

another might be,we change our search tech. a last might be,we've added some security features, or security functionalitiesare layered down. and as a result,just like we require to do on prem, we in the cloud will moveuser mailboxes around. and at a scale of 400 petabytes and growing, we really do haveto stay on top of this. there are usually, over about a dozen of theseprojects at any given time.

so we're continuallyshuffling bits around. now, one of the things that happenswith these projects is people will call me. they'll call and complain andsay karimm, my project's going slow. why haven't you moved my mailboxes? and what we have to do is go andtroubleshoot. and this is the blue badgepeak that was mentioned in the description of the session. i'm gonna show you thingsthat our lawyers would never

actually agree to,that hopefully you'll enjoy seeing. and it's exchange service scaledebugging and troubleshooting. the first thing that happenswhen someone asks me, why is my stuff going soslow, i take a look at this. what you see here isthe availability view of just one partition here. one forest, nam prod 00, that'sour service dog food partition where we run exchange bits. whoops.

i'll just keep it on that day. we run the exchange servicebits before they roll out to broader production ofthe rest of the world. that's the place we turn things on,try things out, before anything happensmore broadly, and it is why you see a slightlybumpy bit of availability. usually world wide it's a prettysmooth line unless something is going really wrong. there's a couple thingsthat youll see there.

the red alert line, eddie will talkabout later when he describes red alerts, but it's a new amalgam ofthe availability across protocols. it let's us define problemseven before individual protocols would fail. and you'll see there isan outlook web app, active sync, and outlook clientthrough the server. and that is grouped monitoringof availability across each one of those protocols, bothsynthetic probes that let us see, are these endpoints working?

are basic transactions happening? and synthetic transactions, wherewe look at user latency and user errors, and roll those up so we canactually tell if users are having problems that our synthetic orartificial probes might not see now, in this case, if i'm trying tofigure out why things are going slowly, this viewdoesn't really help me. there aren't any major incidents andthese projects that we do, these big moves,these big load balancing projects, we run them ina discretionary sense.

they don't take priorityover service health and they don't take priority over userload, which would represented here. and they don't take priorityover user onboarding projects. so, if any of those things are goingon, i should be able to see it and give a quick answer. so, availability's fine. what i can then do is go takea look at our balancing heat map. now this view gives me kind of a bird's eye view ofthe entire exchange online service.

i can guess based on the columns howmany databases there are relatively speaking across the service,and how full any of them are. there's a couple of spotsthat are red, but in general, this view looks pretty clean. we call it the perry schemeafter perry clark, our fearless leader, because greenis good, and green is fairly full, but as far as we'reconcerned green is good. so, we keep looking at thatchart easily every day to say, are things going well?

is the balancingservice doing it's job? and, again,the namprod00 partition is purple. purple is really good. so, there really aren'tany big imbalances there. now, when things startgetting interesting is when i start tolook at throughput. and, what we do againis across our service, we have to look at aggregatedata and aggregate throughput, which is really the most interestingthing to figure out why hundreds of

terabytes of data are movingslower than expected. and there's a couple thingsthere that are interesting. now, some of this is simulated andnot exact view, but you'll see that, on the top middle,that the number of servers or dags, capacity unitsthat have active requests has actually been dropping off fora few days, which is problematic and indicative that there's somethinglower level that's an issue. the next thing i would check is to say well how muchdata has been moving in and

out of the service and as i look,there's a bunch of data moving in. i'll give you an idea of whata data point looks like here. eventually there you go so you cansee, there's actually about ten projects that are trapped in thispartition and this is an idea of how many gigabytes have actuallybeen flowing through each one of these projects that's being trackedby the orchestration of our system. and we can see that loadbalancing has moved, a few days ago it was movingabout 700 gigs in a day, and as we get over here to the 25th,actually very few.

and i'll say,this isn't a simulated problem. this is a real issue that we havebeen debugging over the last few days. the last thing i'll check beforei try to dig into the service is the set of failures inthis particular forest. and again, what we do is collateall of the mrs move reports, the same ones you that you canlook at on prem and categorize the content within them to make surethat we know what's going on and how the service might bereacting and failing.

and again, if i'm trying to figureout why something is going slow, i look at fatal failures. 15 failures, not too many. but, it looks like we're seeingthousands of transient failures in the service which tells me, in that service dogfood environment,that something is up. now, this is the partthat's kinda fun. in the osp view, one of the commonthings we do as engineers is drill down to look at capacity.

we take a look atthe set of services, set of dags,each one of our dags is 16 nodes. we load thousands of mailboxesas you can see here, on each one of the dags. and basically track the cpu, io, replication, network,memory, free space. this is really a snapshot intowhat you might see in perfmon but at scale running perfmon against100,000 servers really doesn't work great.

so we've got a set of tools thatplug into a massive archive of perfmon data that's continuallyrunning against the service so we can see a live viewof what's going on. and our source document environment, what we have done is set asideone of the capacity units, this one over here,namprod00dg5 and it really should be getting megabytes a secondof replication traffic. that means if we're pushing data in,as an engineer, i would expect to see megs of data going in a secondthere and really there aren't.

so what i can do at this pointis drill into the dag and look at what's going on. now, one thing we do is weuse failing servers out, failing servers into upgrade capacity. and, some of the serversare actually out of rotation at this time interval asthey're being upgraded. but the other servers should be ableto handle capacity just fine and there is a couple things that arehighlighted here that are a little bit odd.

you could see that there's a coupleof servers are doing over a meg a second but this server that'schewing up a lot of cpu is in fact, not moving very much data andthat's really a concern. so i can track over time intervals,at the 85th percentile, this server is mostlybusy running at 80% cpu. and, as a concerned engineer,i'm gonna need to really go and dig in and take a look at thatserver and see what's going on. now, as eddy mentioned,i've been in the exchange team for really long time and i'm pretty-oldfashioned on how i go debug.

a person working in my team whodoesn't necessarily need to connect every server might instead of goingand looking at that server would essentially aggregate a bunch ofprocess that will load information. but i would rather lookone server at a time. so we've built a toolcalled measure-performance, which lets us dig into eachmachine or groups of machines or in fact text to see what's going on. now, as this tool is running andthis is live against the exchange online environment,so we cross our fingers.

you'll see a couple things. you could see thatit's real hardware. you can guess from the sku name. let's see if i canscroll back up there. you can guess from the skuname that you saw up above. one more second here. oops, likes scrolling. but it actually is real hardwarethat you yourself can go and buy. sorry, i've lost my mouse.

let's do that one more time. we've got a very slow session here,so if you will forgive me folks. why don't i flip to the pictureto give you an idea. if i can. [laugh] well, what i'll do is i'll flipthe picture while that thing is refreshing. all right here. so as

we render the picture very slowly. very, very slowly. what you will see is essentiallya dump of all the processes running on the box. now, there's some other information,as well, that is interesting to the session,or to debug the server, where the actual problem,forgive me for, again getting stuck. let me just close this andshow you from here. should be a little bit faster.

all right hooray, all right,so what you see here is that server can actually seethe sku that it's running on. we said we wouldn't namethe manufacturers, right. but it's a deal 380 gen 9 with24 cores and 192 gb of ram. and you could see that sample againwhere cpu is approximately 70 at the time of the snapshot. and you see as wella list of processes. now if you've administered andexchange 2013 server or later, you'll see that theseare the exact same process names

of services that run inthe on prem instance. in fact, you don't see any of ourorchestration code really running there except for perhapsthe big data loader in the end which is what helps us uploada bunch of the telemetry. there's a bunch of things thatyou can actually see here. high cpu, number of databasesthat are mounted on the server, request to second, the numberof read to second on c drive and then stack rank listof processes by cpu. now percent cpu, if you'refamiliar with looking at perfmon,

you get 100% per core. and the replication service which isthe same mrs migration service that runs on prem also responsible for load balancing is actually takingup nearly seven cores of cpu and there's something else that'smoderately alarming here which is that there's a large amount of thattime being spent in contention. and as always with engineers,it's a funny story. as the migration team we addedlogging to better understand the performance issuesin the service.

and it is in fact that logging thatcaused contention in this case. and the logging for performance is actually whatcaused the slowdown of migration. >> so normally an investigationlike this, barring the very slow connection to the data center,generally would take a few minutes. and the doc find environmentlets us find these kinds of bugs before they hit scale. before they hit production andget to servers that would be yours, or hosting your data or your users.

with that, thank you,i'll head back to you eddie. >> thank you karim. >> [applause]>> hopefully you had a sense of kind of the world we live in, and ofcourse machines that don't behave. we know quite a lot about that. all right, let's spend sometime talking about monitoring. which is a key area of how werun our services at scale. this is a slide that if you werean engineering exchange team you would be very familiar with.

we look at this probablyevery month and it represents a reallyimportant metric. what it tracks is essentiallythe number of red alerts, which we'll talk about later, thatare generated on a monthly basis. well, why is this important? every time we have a red alert,essentially the engineer gets paged and so, our goal, obviously, is tominimize the number of red alerts we get in the service, while insuringwe don't miss any incident. we also want it to plot the growthof the service as we are running

our monitoring system. and as you can see while the greenline, which represents the server number account, continues to gowe are able to keep the number of red alerts some where betweena thousand and two thousand. to be clear, this is an incrediblylarge and difficult problem that the monitoring team drives ona daily basis in order to ensure we don't simply just keep gettingoverwhelmed with alerts. monitoring is just a start. it is what you do with it thatmatters more than anything else.

we'll talk about threethings in a little bit. the first is the signals,what do you do with it? what can you make sense of it? the insight that hopefully you canderive from understanding signals. and then the workyou do once you have that insight in order tohelp you run the service. it was built formassive scale as you can see. it's probably even higher thanhalf a billion in transactions per hour at this point.

not only that we are constantlyadding and evolving the capability of things like new signalingestion pipes, new data streams. in fact the technologywe used to build this engine has evolved over time. we started originally with a puremicrosoft set of technologies. but recently we startedto incorporate other open source technologies like cassandrawhich we announced last year. in order to deal with data storageand computation, that scale. we will also talk about one of theevolutions of the monitoring system

later on in something we internallycall the customer fabric. one of the key systems that we relyheavily on is this thing called outside-in monitoring. essentially every five minutes allover the world we simulate a series of basic transactions that tells usif the service is running or not. this creates a baseline thatwe will then use later on which i will explain. of course you have to imagine thatso much traffic assumes a reasonably high working network, which wesaw earlier, since we are global.

this is a picture of our networkcabling sort of hanging out. turns out there wasa massive landslide somewhere due to a stormsomewhere in the middle of the us that essentially took out oneof our primary network trunks. we were aware of this when suddenlyall our red alerts start firing because we essentiallysaturated all our backup links. and it wasn't until we realizedthat we lost a major trunk. in fact, it took several days asbased in this aerial photo for us to even get to the location ofthe fiber cut in order to repair it.

here's another picture ofthe damage caused by mother nature. and finally, i'll leave youthis last picture of what are the things we have to dealwith as we run a global service. karim mentioned aboutthe passive monitors. it is also yetanother key signal we use in order to understand what'sgoing on with the service. what we do is we go record a seriesof user generated traffic and signals. after of course we havestripped out pia data.

but we look forthings like user log in. where are they logging in from? are they getting errors? what kind of os are they running? what kind of browserare they running? what we do with this is that,we join them and we'll show you lateron what we do with it. but the key is,this is all user generated traffic. it's all after the fact and

is an important facet of howwe think about monitoring. you maybe thinking, wow,you just generated millions and millions of outside-in monitors andpassive monitors and. what are you going todo with all that data? aren't you going to be overwhelmed? in fact, with most what i would callthreshold based monitoring, yes, we would absolutely be overwhelmed. because everythingassumes that every signal is off high fidelity andsuper accurate.

well this is where the teamhad a very interesting idea. they said, hey rather than relyingon each signal being good or bad, what if you took all of them,and evaluated them together? the idea is that if many signalssay something is wrong, well duh, it must be somethingis actually wrong. as opposed to one, or two, or a few number of them where itcould literally just be noise. and of course,this also takes away the burden of each signal being superaccurate which is a common problem

which a lot of monitoringsystems try to deal with. as it turns out, this approach fundamentally change the game ofmonitoring for us in a few ways. the first is it dramatically reducedfrom millions to essentially thousands the amount of signalswe have to go pay attention to. second, it no longerneeded us to have individual fidelity of each signal. and third, it dramaticallyincreased the confidence that we were able to seethat something was wrong.

in fact, we call this, a red alert. you will have seen this earlierwhen i showed you that we generate somewhere between a thousandto 2000 red alerts. this is sort of the resultof our monitoring system, telling us something is wrong. well, what do you do withall these red alerts? well, this is what we do. we actually take them all, and then we plot them against time,and partitions in the service.

in this case,you probably cannot read it. but each role representsa forest in the service. and, when we see issues thatarise from a single column, or a single day. what it usually tells us issomething at the lower lever such as networking or in some cases active directoryis down and this lets us quickly work without partner teams to getit working as soon as possible. and of course when you see somethingthat goes just at a particular

forest. it means something as unique andisolated to that area, and again it lets us quicklyunderstand what changes we do in that forest to quicklyget it up and running. and of course, every sooften we see this chart and we're still not exactly surewhat it's trying to tell us. let's talk a little bitabout customer fabric. remember, we startedthis monitoring system, what we call the service fabric.

it was amazing in telling useverything we needed to know about machines, databases, forest, pretty much any physical object inthe service we had deep insight to. but, it was really hard to tell,hey, i'm affected. can you tell me whether my customer,my tenant is affected? no, it was very difficult forus to do so. with the customer fabric,what we did was, to leverage all the learnings we had,in terms of building service fabric. but apply them to a new dimension,something that we can look at on

a per customer basis and again wewere very sensitive about the data. we made sure all the pidata were stripped away. but in a nutshell this letsus tell individual customers at a tenant level whether they wereperhaps suffering from an outage enough our service was essentiallysaying everything was good. and on that note i would liketo invite paavany to the stage who will give you a littlebit of insight and demonstration of the newcustomer fabric. [inaudible]

no? okay there you go. hello everyone. my name is paavany, i am a program manager on the office365 customer fabric team. i'd like to spend the couple ofminutes walking you through two tools that we use in donovan. the first one i want totalk about is called, the red alert drop zone which ispowered by the said service fabric.

the second tool is a customercentric tool called linx that's powered by the customer fabric. then i'd like to conclude bytouching upon what all this means for you, what's in it for you. so, let's get started. so what you see over here in frontof you is the tool called the red alert drop zone. so you've heard a bit about ourmonitoring systems, red alerts, etc., etc.

so when these systems actuallydetect a problem with our service we fire an alert. so when this engineer gets paged, the first thing he does isactually opens up this tool. the goal of the red alert drop zoneis to scan through all the data and the service fabric and bubble upthe most relevant sets of insights. that actually empower the engineerto go fix that problem quickly. so over here on the left you seethe scope that is imparted in this specific, a lot of docs about theregion, forest so on and so forth.

you also see the juggernautavailability lines over time, we also bubble up machinelevel information. so it's not just whatthe machine name is, but we tell you what the cpuis what build it's running. how many user mailboxes are actuallymounted on that database and so on and so forth. again, all with the goal of gettingthe engineer up and running really, really quickly. so let's assume this engineer ofours goes and checks in that fix and

it gets deployed andthe service is back up and running. so what happens then? our job is not done yet,what we then do is switch gears and use a tool called links. now just as an aside, lynx is a typeof wildcat, like a tiger, lion, etc. and mythology is attributed severalsupernatural qualities to this creature, including that ofclairvoyance and this deep insight. and so, we naturally haveto name this install linx. anyway, linx is basicallya on demand tool that does

continuous assessment of everysingle user in exchange. so what you see over here isa dashboard that was pulled up for microsoft, so you see you have about 120,000users who've logged in recently. the chart below that shows youthe unique user count over time. so you notice these five peakscorrespond to the five work days, the sort of little bumpscorrespond to the weekends. and just eyeballing the chart, youknow that something went wrong over here, you expect the fifth work day.

our first instinct was,let's go check the service. but, it turns out that wasthe fifth of september and all microsoft employees decidedto suitably not log in and check their emails on that day. let's move on just for a little bit,you see this bar chart over here that docks to all the incidents thathave occurred in the recent past. and now for the very first time,we're actually able to map an incident down to our customerlevel and by that i mean bought down to the tenant level and evendown to the individual user level.

we have a lot more to show at ourservice heal dashboard session tomorrow so come check out thelatest and greatest on that front. so here you see the incidence thathave actually impacted microsoft when they've occurred,how long they've taken, etc etc. so let's switch gears for a little bit and let's assume someof this is running just fine. everything's peachy,the service is humming along. so this is our steady state, sowhat is linx doing in the meanwhile? in the background,

links is continuouslyassessing every single user. so even if the serviceavailability is up at 100, linx is revving in the background. and what we're doing here isgleaning a bunch of insights about our customers. so an example over here, this chart shows you a set of clientside errors that we actually gather, so we'll walk throughan example over here. let's pick on the dumpsteroperation exception,

and some of you may beable to relate to this. this actually isengineering speak for users hitting mailboxstorage limits. another one that you would resonateto is the invalid license exception. we all know license management ismessy and we feel the pain too. that plays, that points to a bunchof users possibly logging in with invalid licenses,expired licenses, etc. so aside from errors andservice held, linx is also trying to build this holistic, this 360degree view of the customer.

so we also are trying to understandwhat devices you guys like, what apps you use. what the status of yoursubscriptions is so this is a view that shows us. what the active subscriptions are,when they're gonna expire, that orange bar marksthe expiration window. so you've seen a bunch ofstuff on screen here and now i'd like to justtie it all together and clarify the impact allof this makes to you.

so we started with the drop zone, the goal there is to reduce thattime to resolve incident so service availability is backup to at least three nines. then we talked about linx that helpus actually validate your true experience. and we talked about service impactsthat's now powering the new and improved service healthdashboard that gives you better service insights. last but not least, i showed youa bunch of steady state insights and

i'd like to give youa quick peek into what. this can do for you,again this is a new offering called everyday insightsfor the tenant admins. again, this is poweredby the customer fabric. the goal here is to gather allthese insights of what we're actually computing andserve it back to the admin. we wanna empower the admin with alla set of insights that truly let him get the most out of office 365. ultimately, to save time to savemoney and just be more productive,

so you will be now be able to tiethe story together because some of the insights that wewanna show over here are. say three users in your organizationare hitting mailbox storage limits or that you have a couple ofusers hitting license problems and, last but not least, say you're downto 20 available exchange licenses. and so we're not justgonna throw this at you, what we wanna do is augmentthis with rich data. of course, we can point youto where you can go and buy more licenses but,we wanna do more than that.

we wanna actually hand youthe sets of users say who, never use the exchange service. you can possibly reassignlicenses and just save money. with that, i'd like to handthe stage back to eddie to talk more about office, thanks. >> [applause]>> all right, we will move on to what i call thebrains of our service management. something we affectionatelycall central admin. of course, we always joke thatwe are turning into robots,

as depicted by this picture. remember, that the service is huge,we have hundreds of thousands of servers with [inaudible] bytes ofuser data that we're responsible for running over an entireglobal network. as a result, we built this specialpurpose system that insures that all actions that are appliedto the service are done so in a safe manner. we do this by ensuring thatthe orchestration of these actions are themselves robust alongwith the workflow themselves.

in fact, everything is forced tobecome code which then allows us to go test it, verify it, and thenvalidate it with other workflows. all within a central placecalled central admin. this is how we protect the service,the user data, and essentially prevent accidentsthat our engineers may accidentally do in termsof harming a service. of course, as like i said inmy earlier comment, what most engineers don't realize is, there'sa greater central authority at play. central admin itself is alsoa major service of its own right.

as a result, we also run it likeany other production service. it means it has its own sla,metrics, life cycle management thatwe track on a regular basis. we also look for poorly writtenwork flows, fix them or remove them from the service. let's talk about something thatis dear to my heart as my team is directly responsible for this. one of the most complex andfrequently run work flow in the service is deploymentbecause deployment essentially

brings thousands of changesper week to the service. we needed a way to protect theservice while we were updating it, what we do is create multiplepartitions in the service that we call rings. let's talk about ring 0 andring 1 this is where me and the entire exchange team haveour own mailboxes hosted. we believe that it is the best way,with a term called dogfooding, for us to directly experience all thechanges that we put on ourselves. it is incredibly important wetreat it like production, but

it also enables us to move very,very fast. in fact, you would have heardearlier, we go, in fact, multiple times a day so that if wedo something wrong, we can quickly apply a new fix by simplydeploying it to ring 0 and ring 1. ring 2 represents the entiremicrosoft population. we deploy roughly once a week, andwe coordinate with our own internal it operations to ensure thatthey're aware that we're deploying. and then if they see anything, tolet us know as quickly as possible. essentially microsoft is ourown last line of defense

before we roll out a changeto the rest of production. and of course, we're alsoa little more careful with msit since blowing up satya's mailboxis probably a fireable offense. ring 3 is what we call sip ring. we also call ita slice in production. it is a small but reasonablylarge part of production which we also go after we'vecompleted our msit deployment. we will rotate this if we end up missing a regression thathad somehow escaped msit.

we want to make sure that we're nothurting customers in an unfair way. and last but not least, ring 4 representsthe rest of our production. and it goes much slower, just like sip only when webelieve we're safe to go do so. it does not mean we always catcheverything, and we'll talk a little bit about what we do when wedo in fact have regressions. one of the key things i do wannaabout is how we manage our hardware. in fact, if you're used to runningservices, you will know that

typically you see about a 10to 15% hardware failure rate. when you have hundredsof thousands of servers, that's a lot of hardware, especially given the differentkind of problems you run into. and in fact, this chart shows youwhat the number one breakdown and what the key major areasthat require fixing. as you can see, if you look at harddisk drives, and power supplies, and fans, essentially things that move,break often. in fact, if you remember earlier,we talked about the logjam

that we saw in gettingservers up and running. well it turns out that hard drivesare kinda like loaves of bread. if you leave them out for too long,they get stale and get moldy. and in fact, that's what happenedwas the servers had been in dock for so long, that by the time wewere ready to stand them up, the drives were failing atan incredibly high rate. and it further caused morelogjams onto the service. that is a lot of hardwareissues to go resolve. well, we created a specialized caworkflow, that we call repairbox.

it is the reason why we actuallydon't see that 10 to 15% of the failure rates. in fact,we consistently see only about a 1% that our engineeringteam has to deal with. this particular workflow basicallygoes around the world and checks for machines, tests them. and maybe proactively pull them outof rotation, in order to avoid them becoming a roadblock whendeployment is about to begin. in fact, we have done work wherewe've tied deployment workflows to

repairbox so that they work bothhand in hand, in terms of isolating, identifying bad machines andtaking them out of rotation. and no, this is not a bunch ofmonkeys that we feed with bananas that keeps running aroundthe data center fixing things. on that note, i would like towelcome mike softword to talk a little bit about what he does forthe service. thank you, mike. >> [applause]>> here you go. >> thank you.

>> let's get going here. so i'm gonna talk a little bitabout what on call is like. if the computer cooperates with me. all right, here we go. all right, so you've seen osp. you've seen some of the othertools that we use for on call, and i'm gonna talk about the specifictype of on call rotation. and that is the incident manager. most on call engineersspend their time

working on the components that theown, the code that they've written. makes a lot of sense, if youwrite it, you know how to fix it. well, we have another rotationcalled the incident manager, which are, typically management,people who have been around a while, who are pretty good atputting out fires or at the very least, figuring out whoto call in the middle of the night. so i thought it wouldbe interesting to talk about my last on call rotation,which was last month. i'll do a very condensedversion of it.

we can't possibly, in five minutes, go throughdebugging the entire service. but i'll give you a flavor forthe tools that we use, and what actually happened withthis specific incident. so to set the stage,this was the second week in august. it was right at the end ofthe two weeks of summer that we have in seattle. so not coincidentally, we hada deployment freeze for two weeks, because we wanted to make sure thatwe got as much sun as we could,

give the deployment teama little bit of time off, and i thought that this would be great. we weren't putting new builds out,nothing was primed, going through the rings. we weren't going to havea lot of turn going on, and i would have a nice easy week. no such luck. it turns out,i came on call monday morning, and almost immediately, i get an alert.

so this is what an alert looks like. it's a mail with a lot ofinformation, big, scary red text. if you scroll down, you see thatit tells you what's going on, tells you whatthe probes are saying. and it's fora very small set, just a dag, telling you what'sgoing on with that dag. tells you whichprotocols are failing. tells you which component is broken. it's kareem's component,so he broke something.

and what that actually reallymeans is that no one can talk to the server but 5% ofthe time, the requests are failing. and so we wanna understand why. it tells you which servers arefailing and it turns out there's one server that is failing much morethan any of the other servers. so that's probablywhere the problem is. and it tells you whatit would like to do. it'd like to restart that backendserver because it thinks that will fix that issue.

and then it's so helpful that itgoes on to tell you three times why it didn't actually go throughwith that auto restart action and decided to call youinstead to investigate. so if you look at the error,the error says global throttling. global limit forauto recovery by red alert per hour. that's really odd,i hadn't seen that before. that suggests that we'redoing a lot of reboots. and we like to have the systemsautomatically recover. but on the flipside, we don't wannahave them automatically recover,

reboot every machine inthe entire service at once, and cause the lights to dimon the eastern seaboard. so we decided to put some hardlimits in place to make sure that we didn't reboot to many machines. we have a report thattells us what's going on. so here's the generic report. and let's drill in andsee what's going on with that day. and this is the reportfrom that day. and as you can see,

this green line here isthe number of reboots per day. kind of goes up anddown, up and down. and you see here, this is rightwhen the deployment freeze, when we were weakened tothe deployment freeze. so it was about the time that wewould start rebooting machines or rebuilding machines. and we see the reboot's creeping up. now that's very interesting. so over the course of a week,it started creeping up.

then you can see, this isexactly when i went on call and then it just shot right up. so i have a huge numberof reboots going on. well, that's also very interesting. so let's see if we canfigure out what's going on with all of these reboots. all right, i have to go over here. so, kareem showed youearlier one of these tools that has performanceinformation in it.

and i'm gonna show youa picture of that tool, since it's kinda hardto go back in time. so this is using the samedata that kareem showed you, except it's perfmon style view. and the way to think ofit is it's like perfmon, except it's perfmon with a dvr forall the machines in the service. so you can go back in time ata specific point in time, and take a look at what's happening. so i've helpfully annotated this.

so i'm looking ata couple of things here. one, i'm looking atthese massive spikes. the bolded thing are the number ofthreads in the replication process. so that means that somethinghappened to cause it to have to spin up a bunch of threads. and really, the only thing that can causethat is a massive latency in i/o. well if we also look at this purpleline, the purple line is kernel cpu. the kernel cpu spikedmassively at the same time.

so that's also interesting. that suggest that the kernel wasdoing something that was blocking up io and causing latency. and if you look at this blue linethat goes ever increasing up, that's the number of secondsthat the server has been up. and you can see that thishas been up about 12 days. so that's pretty interesting. so what we probably wanna do is weprobably wanna take a step further and try to debug what's going on.

let's see if i can. my mousing skills are, aha,let's go to the back up screenshot. so i'll show you screenshot of whatwe found by investigating this so if we look at what's going onwe will see what is the kernel up to and that will give us an ideaof what's actually going on. so here's the call stack that wefound by grabbing an etw trace. and look it inwindows performance analyzer. well it tells us whatthe kernel's doing and we see something interesting.

we see that it's doingextend paging files. and so what this means is it'strying to grow the page file on the system drive. this is causing ntfs to convoy andhold up all of the io. for all of the rest of theapplications, even on other drives. well that's very interesting,why is it doing that and why is it even growingthe page file? so i have another screenshot thatwill show us what the configuration is on this particular machine.

so here's a screenshot, i'm sureyou've seen this dialogue before. so, it says that have a customsize for the page file. the minimum is set to the maximum. and in windows speak that reallymeans your page file should be set, should be constant. but if i look at the current sizeof the page file it's significantly below the size that we've set it to. that's really interesting. that shouldn't happen.

if we've set a fixed page fileit shouldn't be growing and it certainly shouldn't be smallerthan what it actually is. so since we all work at microsoftand we're all engineers and we do a pretty good job oflooking at things ourselves. we can actually just go askthe windows dev what's going on. so we hold it up and we said, what'sgoing on, we don't understand this? and there's a lot of back andforth, a lot of discussions. but it ended like manytech support calls do where they said did youreboot the machine?

does rebooting the machinesolve the problem? well, it turns out that rebootingthe machine does solve the problem because this is a piece ofconfiguration that requires reboot. and in our infinite wisdom, we'd optimized somethingat some point in time. moved this piece of configurationafter the last reboot and so these machines had not been rebootedfor a week and half after setup. they had their default pagefile that had to grow and after they're up for

a bit of time they grew enough thatpage file needed to grow enough that they hit this io contention problemwhich caused these little hiccups. well now that we knewwhat the problem was, the fix was pretty straightforward. well first thing we do is we upthe limit on the auto-reboots. we up the global limits butwe don't up the per dag limits. we don't want the databasesflopping around on the dags but we do want to have morereboots happen globally so that people aren't paged tomanually go reboot machines.

then the next thing we do is wemove the configuration setting before the reboot. and so the final thingto look at is a picture of what it looks like now solet's go pull up the run a report andit's gonna wait. let's see how happy itis with these results. so this is a live result,let's, here are the reboots. here's where i went on call andthis is when i went off call. and all the reboots were done.

and the next guy, who happened tobe kareem, had a nice easy week and didn't have to deal withthis particular problem. so that's a very condensed version of what it's like in a weekof an incident manager. it took a little bitlonger than five minutes. and much like this demo, it was really fueled by caffeine andadrenaline, rather than just five quick minutesof poking around on the computer. and with that, i'll turn it backto eddie so he can wrap things up.

>> [applause]>> thank you, mike. on that note. we're gonna go through a new topic. so far we've talked howwe run the service and at scale but i wanna talk a littlebit about how to improve the product that serves itself throughadditional systems and processes. this is a super-important slide. this is something that'sdrilled into every engineer. from the moment they startboot camp at our team.

it is what we refer toas a service culture. it enables us to focuson what's important and to ensure we constantlyimprove on things over time. in fact, we hold what wecall a weekly service review that focuses on the previous week'sincidents, the key learnings, and then apply them asquickly as possible. one of the more important reviewswe also do with our corporate vice presidents every month, is this thing calledthe monthly service review.

they remind us what is the bar,and how we are supposed to work. and of course they're constantlyraising the bar on us. and if you may recallfrom the previous talks, carey clark is ourcorporate vice president. he is notorious forhaving perfect memory. he will remember when you'vemade the same mistake twice. he is someone we deeply admire andfear at the same time. and this is a picture of him. so how do we improvethe product itself?

our service is made up ofmillions of lines of code. it is incredibly complex, it is even larger thenthe service fabric itself. well, we can do inspections of code. what we really do is letthe service guide us. and what i mean by thatis with the regressions. as you recall earlier we have thisthing called rings and we're trying to find as many regressions aswe can as early as possible. and every time we have a regression,we want to learn as much as we can

in order to improve our product,including end as well as processes. so there are two key partsthat i will talk about. the first is the engineer. we focus on, how did the regressionget into the system? we have multiple systems thatessentially measure things like code coverage, performance, security andother code-related health checks. all this feeds backin to the engineer directly in the placewhere they write code. typically inside visual studio,in the form of hints and

suggestions directlyas they're writing. this enables us to make sure thatevery time we hit a regression, we are making everyengineer smarter and smarter by baking that learninginto their environment. in addition, we also have a set ofloosely connected systems that we refer to as the entering system. one of them actuallyis responsible for running the hundredsof thousands of tests pretty much all the time on the codebefore it even gets to production.

so every time we see a regression,we look for gaps in our tests and we fill them. one of the running jokes we haveis we often tell engineers yoda is helping you. one of the other ways that wehelp to improve the service is in some cases we run what we call wargame exercises or monthly drills. we have a team called the red team,what their job is is to essentially look forvulnerabilities in a service. they don't tell anybodythey're doing this.

their job is in factto not tell anyone. of course they do this with a veryclear set of rules of engagements. no destructive testing,focus only on microsoft. in fact recently oneof their team members managed to find a new security hole. inside one of kareem's areas andessentially took one of the keys and made one of our corporatevice presidents embarrassed. needless to say,kareem was not impressed. we also offer this bug bountyprogram which some of you

may be aware of wherewe essentially let others outside of the service helpus find additional vulnerabilities. we have such a massive surfacearea that all eyes and any more hands are gonna be helpful. we also have a blue team thatworks to find intruders, look at data, find new waysto detect such intrusion, and to protect the service. we will then occasionallysimulate drills with and without each other as a way toinvest in improving our process.

we have a series of talks in therest of the session at ignite that focuses on security. i highly invite youto go check them out. sara manning dawson is a gpmwho's a peer of mine and she's heading upa field those talks. you may have noticed, hey, there's another service providerthat does kind of something similar. in fact, forthose who are familiar with it, they ended up calling their work,chaos monkey.

essentially, we haveour own version. all right, on that note, i hopeeveryone has enjoyed the talk. it has been a tremendouspleasure for me and my team. to come here to talk toyou about the work we do. we look forward to any and all feedback, as well asany questions you may have. and one last thing before we go on. please visit the exchange booth. we're looking for customers tosign up for our tap programs.

on that note, thank you very much. >> [applause]