open All Channels
seplocked EVE General Discussion
blankseplocked Server Outage
 
This thread is older than 90 days and has been locked due to inactivity.


 
Pages: 1 2 3 [4] 5

Author Topic

Yur mom
Posted - 2008.10.30 03:44:00 - [91]
 

Originally by: Triffon
Originally by: AdioHyperion
Originally by: Generic Alt
Originally by: Ghost Goat
guess the old "lets reboot the cluster and hope it works"
didnt work Crying or Very sad

i was soooo close to change a skill ,give us the damn que allready .


I was literally 5 mins away from skill train completion when I got booted.

I agree... give us the skill queue, please! Very Happy


+1

+1

+.86 (stacking penalty)

CCP Valar

Posted - 2008.10.30 03:45:00 - [92]
 

This is the real deal. Server will be up in a few minutes, possibly a bit laggy while the DB server cache is rebuilt.

I'll post a little post-mortem here in a few.

Nacho Muchos
Posted - 2008.10.30 03:45:00 - [93]
 

Originally by: AdioHyperion
Originally by: Generic Alt
Originally by: Ghost Goat
guess the old "lets reboot the cluster and hope it works"
didnt work Crying or Very sad

i was soooo close to change a skill ,give us the damn que allready .


I was literally 5 mins away from skill train completion when I got booted.


I agree... give us the skill queue, please! Very Happy


+1



I was 15 minutes from finishing a skill, and I was warping into a mission when it went down. I sure hope my ship is still there when I login, and yes we definitely need some sort of skill queue for things like this, even if it is only one skill in advance. I was just about to give up and go to sleep but that would cost me another 6 or so hours of skill training, which is pretty crucial when you are training a new character. >.>

Ghost Goat
Posted - 2008.10.30 03:45:00 - [94]
 

Originally by: Generic Alt
Originally by: Ghost Goat
guess the old "lets reboot the cluster and hope it works"
didnt work Crying or Very sad

i was soooo close to change a skill ,give us the damn que allready .


I was literally 5 mins away from skill train completion when I got booted.

I agree... give us the skill queue, please! Very Happy


5 mins ! man i was hoovering over the skill i wanted to switch to,clicked it and
was about to go to sleep when i noticed it didnt changed and a sec later
i got disconnected .

AdioHyperion
Caldari
Ships N Stones
Posted - 2008.10.30 03:45:00 - [95]
 

Originally by: CCP Valar
I'll post a little post-mortem here in a few.


Do what?

Nik W
Posted - 2008.10.30 03:45:00 - [96]
 

Originally by: Triffon
Originally by: AdioHyperion
Originally by: Generic Alt
Originally by: Ghost Goat
guess the old "lets reboot the cluster and hope it works"
didnt work Crying or Very sad

i was soooo close to change a skill ,give us the damn que allready .


I was literally 5 mins away from skill train completion when I got booted.

I agree... give us the skill queue, please! Very Happy


+1

+1

+1

commander tennder
Posted - 2008.10.30 03:45:00 - [97]
 

5 sec

Zach Forrester
Posted - 2008.10.30 03:46:00 - [98]
 

Originally by: CCP Valar
This is the real deal. Server will be up in a few minutes, possibly a bit laggy while the DB server cache is rebuilt.

I'll post a little post-mortem here in a few.


Again, I apologise if I had anything to do with this. XD I swear I'm paranoid now. A hundred angry mercenaries will come to pod me, I'm sure of it. *panicpanicpanic*

Otho Underhill
Posted - 2008.10.30 03:46:00 - [99]
 

In

Super Skulls
Posted - 2008.10.30 03:46:00 - [100]
 

zomg it worx!!!!

Tukaa
Amarr
Bound And Determined
Posted - 2008.10.30 03:46:00 - [101]
 

ONLINE

Mankirks Wife
Caldari
Deep Core Mining Inc.
Posted - 2008.10.30 03:46:00 - [102]
 

Originally by: Yur mom

+.86 (stacking penalty)


Due to balance issues, we have seen fit to nerf pyramid quotes.

AdioHyperion
Caldari
Ships N Stones
Posted - 2008.10.30 03:47:00 - [103]
 

So far so good....

Cygnus DivumExuro
Gallente
Booming Industry Operations
Knights Collective
Posted - 2008.10.30 03:47:00 - [104]
 

bet folks are not thinking those IBM Blade Servers are not as cool as first thought!

Taius Pax
Posted - 2008.10.30 03:47:00 - [105]
 

woot! it's back! Very Happy

sp009
Caldari
Organization Outcast
Posted - 2008.10.30 03:47:00 - [106]
 

The Servers up everyone togather log in n see if it is really fix'd Cool

Taius Pax
Posted - 2008.10.30 03:48:00 - [107]
 

Originally by: Cygnus DivumExuro
bet folks are not thinking those IBM Blade Servers are not as cool as first thought!

was that double negative intentional? Shocked

Noriko Sakai
Gallente
DC1 Coalition
Posted - 2008.10.30 03:48:00 - [108]
 

Server up ;-)

Saardinen
Posted - 2008.10.30 03:49:00 - [109]
 

I'm just waiting for the postmortem. *nailbites*

Dinsdale Pirannha
Gallente
Posted - 2008.10.30 03:51:00 - [110]
 

I have worked for a large company (think 3 letters) and we hosted many, many customers' web sites. Very rarely, we would have outages. All hell would break loose, even when we did not own the software/hardware/network.

In the case of the problem being traced back to something my company did, somebody would get fired. It is simply incomprehensible to me that ANY business would mess with their active database during regular business hours (that is the 23 hours/day in this case). And trust me, the only way CCP had a failure was if something WAS CHANGED.

You want to make changes that can't be done in the one hour you have allocated every day, you SCHEDULE A CHANGE WINDOW. You give everyone at least a week's heads up, and you bring down the server at a time which impacts the fewest people.

But this ad hoc outages are ridiculous, and would never happen if CCP actually followed some kind of discipline that every other company follows where its customers need to access information online.

Can you imagine what would happen if your bank decided to upgrade their database at 2:00 pm on a Wednesday? BTW, it is standard practice for banks, and most large firms, to schedule network outages from 12:01 am to 4:00 am on Sundays. That of course would not work for CCP, given the nature of their business, but there are times where impact would be minimized on their customer base.

I have been playing Eve for 6 months and am stunned at the lack or professionalism of their technical staff. The CIO should be fired.

Elaron
Jericho Fraction
The Star Fraction
Posted - 2008.10.30 03:56:00 - [111]
 

Originally by: Dinsdale Pirannha
<snipped a rant>


How about you wait for the post mortem before you get on your high horse? By past experience Valar & co are pretty good at explaining the causes of these unplanned outages.

CCP Mitnal


C C P
Posted - 2008.10.30 03:57:00 - [112]
 

Originally by: AdioHyperion
Originally by: CCP Valar
I'll post a little post-mortem here in a few.


Do what?


Valar will explain what happened Smile

I believe it will state that the servers were not at fault for those knocking them.

Xaniff
Posted - 2008.10.30 03:58:00 - [113]
 

Originally by: Dinsdale Pirannha

In the case of the problem being traced back to something my company did, somebody would get fired. It is simply incomprehensible to me that ANY business would mess with their active database during regular business hours (that is the 23 hours/day in this case). And trust me, the only way CCP had a failure was if something WAS CHANGED.

You want to make changes that can't be done in the one hour you have allocated every day, you SCHEDULE A CHANGE WINDOW. You give everyone at least a week's heads up, and you bring down the server at a time which impacts the fewest people.

But this ad hoc outages are ridiculous, and would never happen if CCP actually followed some kind of discipline that every other company follows where its customers need to access information online.

I have been playing Eve for 6 months and am stunned at the lack or professionalism of their technical staff. The CIO should be fired.


I think the problem has been more of a hardware issue than software. And there was that problem with a local ISP yesterday which may have contributed.

Zach Forrester
Posted - 2008.10.30 03:59:00 - [114]
 

Originally by: CCP Mitnal
Originally by: AdioHyperion
Originally by: CCP Valar
I'll post a little post-mortem here in a few.


Do what?


Valar will explain what happened Smile

I believe it will state that the servers were not at fault for those knocking them.


XD I'm gonna die, I know it. I honestly didn't mean it, whatever I did.

Ghost Goat
Posted - 2008.10.30 03:59:00 - [115]
 

Originally by: Dinsdale Pirannha
I have worked for a large company (think 3 letters) and we hosted many, many customers' web sites. Very rarely, we would have outages. All hell would break loose, even when we did not own the software/hardware/network.

In the case of the problem being traced back to something my company did, somebody would get fired. It is simply incomprehensible to me that ANY business would mess with their active database during regular business hours (that is the 23 hours/day in this case). And trust me, the only way CCP had a failure was if something WAS CHANGED.

You want to make changes that can't be done in the one hour you have allocated every day, you SCHEDULE A CHANGE WINDOW. You give everyone at least a week's heads up, and you bring down the server at a time which impacts the fewest people.

But this ad hoc outages are ridiculous, and would never happen if CCP actually followed some kind of discipline that every other company follows where its customers need to access information online.

Can you imagine what would happen if your bank decided to upgrade their database at 2:00 pm on a Wednesday? BTW, it is standard practice for banks, and most large firms, to schedule network outages from 12:01 am to 4:00 am on Sundays. That of course would not work for CCP, given the nature of their business, but there are times where impact would be minimized on their customer base.

I have been playing Eve for 6 months and am stunned at the lack or professionalism of their technical staff. The CIO should be fired.


well you can say that , or you can say
wooooooooooooooo omgomgogmg the server is up !11!11!1111
/me do the happy dance ,after realizing what i just did go hide in the corner in shame .

skill changed i can go to sleep a happy man now ...

but we still need the damn skill que .

ghost training is a goner , no excuses now .

CCP Valar

Posted - 2008.10.30 04:01:00 - [116]
 

With my outfit still smoking a bit, fresh from some firefighting, I bring you... THE POST MORTEM

The server crash tonight was a result of our attempts to prevent a server crash earlier today.
Around 21:20 this evening, we had an automatic alert go off, warning us that a RAMSAN was critically low on disk space. In an attempt to fix this, we shrank the data file on it and started an index defrag, to free up space in the datafile.... crisis averted... or so we thought.

At 2 AM, a full backup of the database started, but the index defragmentation of the biggest, most critical table in the database was still underway.
While a full backup of the database is being performed, the transaction log is not truncated on transaction log backups and with the increased activity that comes with the index defrag, the transaction log quickly grew to fill up both RAMSANs...
This is what lead to the server crash.

When I got a phone call from the on-call person, the server was already down.
I proceeded to do a fail-over of the database server, shrink the transaction log files and the datafiles on the RAMSANs.
When I had done this, I attempted to start the server, but it took 3 startup attempts before nodes started registering themselves in the database on time, likely due to the "warming up" the database has to do after a failover.

I'm truly sorry for the inconvenience this caused you and hope you can enjoy playing for the rest of the night.

Pr1ncess Alia
Posted - 2008.10.30 04:02:00 - [117]
 

Edited by: Pr1ncess Alia on 03/11/2008 06:57:48

Zach Forrester
Posted - 2008.10.30 04:03:00 - [118]
 

Originally by: CCP Valar
With my outfit still smoking a bit, fresh from some firefighting, I bring you... THE POST MORTEM

The server crash tonight was a result of our attempts to prevent a server crash earlier today.
Around 21:20 this evening, we had an automatic alert go off, warning us that a RAMSAN was critically low on disk space. In an attempt to fix this, we shrank the data file on it and started an index defrag, to free up space in the datafile.... crisis averted... or so we thought.

At 2 AM, a full backup of the database started, but the index defragmentation of the biggest, most critical table in the database was still underway.
While a full backup of the database is being performed, the transaction log is not truncated on transaction log backups and with the increased activity that comes with the index defrag, the transaction log quickly grew to fill up both RAMSANs...
This is what lead to the server crash.

When I got a phone call from the on-call person, the server was already down.
I proceeded to do a fail-over of the database server, shrink the transaction log files and the datafiles on the RAMSANs.
When I had done this, I attempted to start the server, but it took 3 startup attempts before nodes started registering themselves in the database on time, likely due to the "warming up" the database has to do after a failover.

I'm truly sorry for the inconvenience this caused you and hope you can enjoy playing for the rest of the night.


Okay, so it wasn't my fault. ^^ *walks away a free man* XD I swear I thought I was gonna be hunted down there.

Syberbolt8
Gallente
The Scope
Posted - 2008.10.30 04:05:00 - [119]
 

Originally by: Zach Forrester
Originally by: CCP Valar
With my outfit still smoking a bit, fresh from some firefighting, I bring you... THE POST MORTEM

The server crash tonight was a result of our attempts to prevent a server crash earlier today.
Around 21:20 this evening, we had an automatic alert go off, warning us that a RAMSAN was critically low on disk space. In an attempt to fix this, we shrank the data file on it and started an index defrag, to free up space in the datafile.... crisis averted... or so we thought.

At 2 AM, a full backup of the database started, but the index defragmentation of the biggest, most critical table in the database was still underway.
While a full backup of the database is being performed, the transaction log is not truncated on transaction log backups and with the increased activity that comes with the index defrag, the transaction log quickly grew to fill up both RAMSANs...
This is what lead to the server crash.

When I got a phone call from the on-call person, the server was already down.
I proceeded to do a fail-over of the database server, shrink the transaction log files and the datafiles on the RAMSANs.
When I had done this, I attempted to start the server, but it took 3 startup attempts before nodes started registering themselves in the database on time, likely due to the "warming up" the database has to do after a failover.

I'm truly sorry for the inconvenience this caused you and hope you can enjoy playing for the rest of the night.


Okay, so it wasn't my fault. ^^ *walks away a free man* XD I swear I thought I was gonna be hunted down there.


We can hunt you down anyway if you like... shouldn't be much of an issue. Twisted Evil

Stuart Bruegel
Gallente
Posted - 2008.10.30 04:09:00 - [120]
 

Originally by: CCP Valar
With my outfit still smoking a bit, fresh from some firefighting, I bring you... THE POST MORTEM

<snip>





Gotta love backups. I swear they break things far more often than you have to use them.

Thanks for the detailed post-mortem! Those of us in the business appreciate it.

Go get some sleep. :-)


Pages: 1 2 3 [4] 5

This thread is older than 90 days and has been locked due to inactivity.


 


The new forums are live

Please adjust your bookmarks to https://forums.eveonline.com

These forums are archived and read-only