open All Channels
seplocked EVE Information Portal
blankseplocked New Dev Blog: TQ Move Outage Details
 
This thread is older than 90 days and has been locked due to inactivity.


 
Pages: 1 [2] 3 4 5

Author Topic

Camios
Minmatar
Sebiestor Tribe
Posted - 2010.06.30 15:33:00 - [31]
 

Next time be more cautious!

Go on with the good work.

perix
Minmatar
Sane Industries Inc.
Initiative Mercenaries
Posted - 2010.06.30 15:40:00 - [32]
 

Thank you for the information and update I find it interesting to read about a complex environment as tranquillity.

CCP Yokai

Posted - 2010.06.30 15:50:00 - [33]
 

Edited by: CCP Yokai on 30/06/2010 15:53:25
PICS!!!

Overhead before cable
Please visit your user settings to re-enable images.

Cables
Please visit your user settings to re-enable images.

Connecting and testing each and every one of the Ethernet ports
Please visit your user settings to re-enable images.

Cleaned up
Please visit your user settings to re-enable images.

Overhead view of the cabinets with a look at the air containment transparent tiles
Please visit your user settings to re-enable images.

We blew out a shoe
Please visit your user settings to re-enable images.

The team for this trip... (not the move team, but one of two prep trips)
CCP cNOC
CCP Mindstar
CCP Yokai
CCP Zirnitra
Please visit your user settings to re-enable images.

This is all we have for now... all the pre move prep work. The post move pics we'll do next month after we finish up migrating the non TQ items.

Camios
Minmatar
Sebiestor Tribe
Posted - 2010.06.30 15:55:00 - [34]
 

Edited by: Camios on 30/06/2010 15:59:58
You were grinning, that means that this photo has been taken before the mess.

cool pics btw

teko82
Caldari
Mark Of Chaos
Posted - 2010.06.30 16:06:00 - [35]
 

I keep wondering about the whine everytime there is a big change and a longer DT than expected, by now most players should know that you allways put on a long skill for these days. And I do hope that people have other things in their lives they can do during these DT's.. Most pc's have solitare Razz

The company I work in never have issues this big taking that long to fix, but with around 200K employees they have alot of backup stuff to secure every change. On the other hand, everytime there is a larger change in any of the systems we use, it will normaly be running as it was intended around 2 months after the launch, in Eve it is like max a week Very Happy. So this is nothing in my opinion ;) And nice pics, feel like my pc's need to be upgraded Laughing

Ix Forres
Caldari
Righteous Chaps
Posted - 2010.06.30 16:21:00 - [36]
 

I'd just like to echo those congratulating you; I can only imagine how nasty that night/day must've been for you all personally, and I think everyone appreciates the dedicated work along with your diligence in putting the integrity of the game first, and not rushing ahead.

On an entirely unrelated note, how about a complete list of harwdare and network topography? I know a lot of people would be very interested - I know the basics are out there but some of us are truly geeky and love to hear about the architecture, infrastructure (right down to power management and cooling systems) and so on!

Batolemaeus
Caldari
Free-Space-Ranger
Morsus Mihi
Posted - 2010.06.30 16:21:00 - [37]
 

Originally by: CCP Yokai

PICS!!!


Thank you, the thread is now complete.

Pwnzorator
Posted - 2010.06.30 16:26:00 - [38]
 

Originally by: CCP Yokai

We blew out a shoe



I've seen holes like that before! Someone stood on one of the sticky-up floor bolts in a rack?

Overall, that's a really nice neat install job. Better than most of the cable monsters I've had the misfortune to work with

Commander Azrael
Red Federation
Posted - 2010.06.30 16:29:00 - [39]
 

That's awesome, love the pics :). kudos to CCP for getting the cluster back up and running after a full suite move which are a royal pain in the ass! We consolidated 3 of our suites about a year ago into 1 big suite with 120 racks and that was an absolute nightmare! So I feel your pain :)

Moar pics! Embarassed

</geek mode>

Cinori Aluben
Minmatar
Gladiators of Rage
Intrepid Crossing
Posted - 2010.06.30 16:30:00 - [40]
 

Edited by: Cinori Aluben on 30/06/2010 16:31:02
CCP Yokai, you are the man. I'm continuing to like you more and more. And nice pics! (Btw how did that shoe get blown out? And was the foot inside unharmed? lol)

Couple things I appreciated from your blog:
Quote:
Despite rumors and criticisms to the contrary, our plan included a significant time buffer for the work.
Glad to know this, even though it unfortunately took even longer. I was called unprofessional for suggesting a buffer, but you showed wisdom in figuring in a time buffer.
Quote:
VIP Mode is when Tranquility is up, but accessible to CCP staff only (many of you noticed and were curious about why 30+ others were on TQ while you couldn‘t login).
Hook a brotha up with VIP invite yo! Cool
Quote:
We all really appreciate the understanding and kind words... and even the harsh ones we needed to hear.
You are a humble, good dev. Keep listening, and keep learning, you can never stop learning. Many CCP could learn from your approach here, and I hope you get credit for such.

I look forward to your responses to all the great IT questions in these comments, and to future server improvements I know you've got coming up the pipe :)


Dismas Ofstedal
Minmatar
Dead Pilots Society
Chaos Theory Alliance
Posted - 2010.06.30 16:40:00 - [41]
 

Personal anecdote: Waaay back, when I was a lowly hamster keeping the reel-to-reel tape drives loaded I watched as the techies tried to debug a hardware issue or two on a new mainframe gadget - they had circuit boards strewn all over the floor. They went home to sleep and look at the problem with fresh eyes in the morning. Sometime during the night, a janitor picked up all the cards and they got put through the garbage crusher. ugh

Seriously, this outage was like the aftermath of a burrito binge. You think the world is ending at the time, but a few days later, it's just a dim memory and you're jonesing for more burritos.

It was a minor inconvenience, just a tiny part of the bigger adventure, and well compensated. I've had poorer service from my former bank.

Thanks CCP.

Amida Ta
German Mining and Manufacture Corp.
Posted - 2010.06.30 16:56:00 - [42]
 

So the big question remains unanswered in the blog:

"after finding the root cause"

So what was the root cause?

Tsabrock
Gallente
Circle of Friends
Posted - 2010.06.30 17:04:00 - [43]
 

I had planned for an extended downtime, as being a small-time techie myself I know how much longer things can take to fix. Maybe the Mr. Scott method of time estimation is in order?

I am greatly curious, how long do backups of the TQ database take? How many GB of data are ye dealing with? I know it's vastly more than anything I do with my own service & repair, and have always wondered.

RedClaws
Amarr
Macabre Votum
Morsus Mihi
Posted - 2010.06.30 17:16:00 - [44]
 

Originally by: Amida Ta
So the big question remains unanswered in the blog:

"after finding the root cause"

So what was the root cause?


root beer Laughing

Nice pictures, we do love pictures of geeky computery stuff

Dead Cheese
Gallente
Fat Kids
Posted - 2010.06.30 17:22:00 - [45]
 

Edited by: Dead Cheese on 30/06/2010 17:25:00
As a professional geek myself, I too am a little perplexed at the seeming lack of a special non-scheduled backup. A full DB dump should have been done after the app cluster was shut down, but before the DB cluster was shut down. Transaction logs are only there to save you in the event of an unplanned outage - to catch you up to your current dataset from last night's backup. We shouldn't even be discussing transaction logs with a major planned change such as this.

Furthermore, what make/model of SAN are you running? All major SAN vendors have a snapshot capability. If a snapshot of the DB volumes had been taken just before the suggested DB backup, the DB would have been back up in five minutes. Then the only real downtime would be due to your lengthy testing procedures (which is a superior move, BTW).

I'm so confused. More details for the geeks please!

Jowen Datloran
Caldari
Science and Trade Institute
Posted - 2010.06.30 17:59:00 - [46]
 

Nice blog and nice pictures.

I actually spend much of the time when Tranquility was down playing EVE... on Singularity.

Apparently few realized that Sisi was up and running for most of the Tranq downtime. I went and explored some places in 0.0 and wormhole space that I otherwise properly would never get to see.

Yalluto
Gallente
Ascenda Group
Free EVE Alliance
Posted - 2010.06.30 18:05:00 - [47]
 

I think the point that has been brought up by a few about transaction logs being irrelevant as par for course and instead a full backup being performed as a non restarting downtime job are point on. I would have thought that a problem with the SAN on it's initial production run/move causing data corruption would have been as simple as isolating and fixing the cause, and then wiping and slapping on the backup.

Post that, what everybody is missing here is the number of integrity checks that were mentioned and how long they can take. Everybody needs to acknowledge that this step which is paramount to us all is a time consuming process. And the better the integrity checks (sounds like CCP has put a lot of effort into integrity check procedures) the better job they will do.

Ah, and for querry, I'll bet CCP loves the setup they have of imaging the nodes for bringing them back up, which prevents a lot of individual machine teching around.

RentableMuffin
Posted - 2010.06.30 18:33:00 - [48]
 

Burrito buying spree.... STOP SPYING ON ME!!!!

Wynteryth Fett
Posted - 2010.06.30 18:42:00 - [49]
 

I'd like to thank CCP Yokai for detailed account of what happened during the downtime..

In the other comments thread, I'd asked these 4 questions.

Quote:
1) Why wasn't the new equipment and one system set up weeks in advance to prevent this extended down time?
2) Has CCP taken the precautions necessary to ensure the "database errors" *wink wink* don't re-occur going forward?
3) Does the new server location have the infrastructure to ensure that the system doesn't go down due to something minor like a bad PDU or short-term loss of power?
4) Are the servers with the new equipment set up in at least a 2N redundancy so something as simple as hardware failure doesn't shut the game down?


As I mentioned in that other thread, I have a lot of background on this from a Disaster recovery standpoint, but know that the concepts still hold true..

It's good to know that you all had much of the networking items set up ahead of time. I'd like to ask if the servers are set up in a 2N redundant system. If they are, why wasn't only one of the systems moved ahead of time? If the servers aren't in a 2N redundant system, why not?

Has the cause of the Database errors been determined? Could they have been found ahead of time with a 2N redundant system?

Also, Is there one GIANT Database for everything? or are there Multiple ones (One for plantets/stations, one for items, one for player info)? Without giving away any potentially proprietary CCP information, could you tell us if there is one giant database or several smaller ones that link to it?


Jackie Fisher
Syrkos Technologies
Joint Venture Conglomerate
Posted - 2010.06.30 19:19:00 - [50]
 

Edited by: Jackie Fisher on 30/06/2010 19:21:15
Originally by: CCP Yokai

Connecting and testing each and every one of the Ethernet ports
Please visit your user settings to re-enable images.


Solved your problem for you - there is a tramp living in your kit. Don't buy any Big Issues from him or he'll never leave.

Dillon Arklight
Aliastra
Posted - 2010.06.30 19:25:00 - [51]
 

Originally by: Jackie Fisher
Edited by: Jackie Fisher on 30/06/2010 19:21:15
Originally by: CCP Yokai

Connecting and testing each and every one of the Ethernet ports



Solved your problem for you - there is a tramp living in your kit. Don't buy any Big Issues from him or he'll never leave.



Very HappyVery HappyVery Happy

Thanks for the AAR, just goes to show how much work goes into keeping TQ healthy.

Yuda Mann
Posted - 2010.06.30 19:45:00 - [52]
 

Originally by: Tanjia Guileless
"What are we doing to prevent this?"

Migrating to a serious database product?


Leave that poor dead horse alone. If you think MSSQL isn't a serious database product then I invite you to come back to this universe instead of the one you're floating in. At the time the devs were looking at db's, MySQL sucked for massive projects and couldn't do the things they needed. At that time as well, MSSQL was the best option.

The devs have already explained why they'd never switch as well. You might notice that it's pretty much basic common sense too. Then again, you might not.

http://www.eveonline.com/iNgameboard.asp?a=topic&threadID=1095044&page=5#123

The answer to this question is simple, the cost of redevelopment is huge. You can't imagine the amount of code we have in T-SQL. It will always be cheaper to buy more hardware than reinvest in MySQL, an investment that may or may not give you some performance. You can't know if it will give you performance benefits until you have been able to make EVE work on it.

That is why we have not even considered changing our database platform. That's as simple as that. We are also very happy with SQL Server.

As I said before I don't like platform debates, so I will probably not post more in this thread, as I don't want to be pulled in.
----
Senior Virtual World Database Administrator
Operations department
CCP Games

Sturmwolke
Posted - 2010.06.30 20:45:00 - [53]
 

Edited by: Sturmwolke on 30/06/2010 21:14:49

Originally by: Dead Cheese
As a professional geek myself, I too am a little perplexed at the seeming lack of a special non-scheduled backup. A full DB dump should have been done after the app cluster was shut down, but before the DB cluster was shut down.


This tbh, a failed transfer/corruption scenario should have been anticipated and included in a failover/recovery plan for this move. If what the blog says is true and a SAN failure cause DB issues, that would be classified as a failed transfer/corruption.

Did you guys really have such plans ready before initiating the move? Fess up. It's almost as if you're doing it on-the-fly and then crossing your fingers for the best that nothing screws up.

P.S This looks like a process failure than any real technical failure to me. Either failure to anticipate a failure scenario (which really is a no brainer) or undue risk taking on the planning part.

Edit:clarity

Xornicon Altair
Woopatang
Primary.
Posted - 2010.06.30 21:09:00 - [54]
 

Stop using Microsoft to run your database. Not only will you have less issues with the database, but, performance will increase by not requiring the ridiculous overhead that Microsoft Operating Systems demand. Just my opinion, but, there are so many better options out there than MS.

Dalilus
Posted - 2010.06.30 21:18:00 - [55]
 

Let's use a fishing metaphore.....you and your friends plan a fishing trip months in advance. Since you know you will be raked over burning coals, tar and feathered not to mention severily ridiculed, if your boat does not work properly a mechanic is hired to go over the engines, transmissions, etc., making sure everything is ship shape. That faulty RPM gage is replaced, battery charger and batteries checked, fuel lines and tanks cleaned, oil and air filters changed, toilet flushed a few times making sure it works, paperwork and fishing permit tripple checked, safety equipment and radio onboard and working. You are set.

Ideally you would go fishing with friends spending 8 - 10 hours on the water and getting a ton of fish, the beginnings of a sunburn, hours of video, tons of photographs, chugging cases of beer, empting many bladders, eating sushi, sandwiches and ceviche, swimming with the fish, all in all a great fishing day. Back at the dock you or your boat boy would clean the boat and tackle, loading fuel and bait for next days fishing, checking the hot engines making sure everything is in order and finally catching up with your friends to wash up and party all night at the local drinking hole.

Instead you wake up early to go and check oil and water levels before your friends show up. When you try to fire up the engines, you find out that you and your friends cannot go out because the manifold on one of the engines is cracked and it takes a day or two to get it repaired/replaced. Let the roasting begin as you scramble to find a rental, on your nickle, while your pride and joy is being pulled out of the water to sit on the hill while it is repaired.

IMO CCP did good.

Adonais Templar
Minmatar
DemSal Corporation
DemSal Unlimited
Posted - 2010.06.30 22:18:00 - [56]
 

Having worked in the industry I have a good idea what it tooked to make the move. Must say good job, I thought personally the downtime was an underestimation. Kudos on fixing the db problem so fast, databases are probably the worst problem to fix. The bonuses from the longer than expected downtime was worth it. Planning any long downtimes again ;-).

Knaar
Posted - 2010.06.30 22:53:00 - [57]
 

Edited by: Knaar on 30/06/2010 22:53:40
Originally by: Xornicon Altair
Stop using Microsoft to run your database. Not only will you have less issues with the database, but, performance will increase by not requiring the ridiculous overhead that Microsoft Operating Systems demand. Just my opinion, but, there are so many better options out there than MS.


http://en.wikipedia.org/wiki/Vendor_lock-in

Glowstix
Broski Enterprises
Posted - 2010.06.30 22:53:00 - [58]
 

Thanks for another top notch job, CCP. You guys put up with a TON of (mostly undeserved) abuse, and still came through with a fix and then share what happened and what you did with us in detail right after. Not only do you go farther than other companies who just leave it at "downtime took longer than expected due to technical issues", but you also explained how you chose to do a much more detailed and meticulous fix as opposed to taking the quick and easier way of just rolling back. Even when people are screaming to the heavens, and some being quite childish, you guys are patient and try to work out the best solution for us.

You guys rock.

<3

Knaar
Posted - 2010.06.30 23:14:00 - [59]
 

I just wanted to say that you guys are doing an awesome job. Despite coming up against Finagle's Law you made the right decisions.

One thing you should do is realize that every whiner is actually a hopelessly addicted customer that needs his/her fix and gets super grumpy without it. We wouldn't be hopelessly addicted unless you all were doing something severely wonderful. So in reality every whine is just an admission of your magnificence in disguise.

Zathi Shaitan
Illiteracy Combatants
Posted - 2010.06.30 23:34:00 - [60]
 

MSSQL always was a fail cascade, is still a fail cascade, and will continue being a fail cascade.


Pages: 1 [2] 3 4 5

This thread is older than 90 days and has been locked due to inactivity.


 


The new forums are live

Please adjust your bookmarks to https://forums.eveonline.com

These forums are archived and read-only