open All Channels
seplocked EVE Information Portal
blankseplocked New Dev Blog: The Long Lag
 
This thread is older than 90 days and has been locked due to inactivity.


 
Pages: 1 2 [3] 4 5 6 7

Author Topic

F90OEX
F9X
Posted - 2010.08.17 21:17:00 - [61]
 

Originally by: Frug
That blog was hot.

Makes me aroused.


Pity its going to take you 18 months to climax Razz

Rexthor Hammerfists
Caldari
Vanishing Point.
The Initiative.
Posted - 2010.08.17 21:23:00 - [62]
 

What cought my attention was the line that at some point there is a limit of players on one node. It makes sence even to a halfwit like me, and even a halfwit like me has to wonder why you didnt take the chance, when implementing a whole new sov system, to spread attackers and defenders over neighbouring systems or even constellations instead of shoehorning everyone into one system.

How about this one: Bcus are to be anchored in the neighbouring system on the inbound gates of the attacked system. Throw some balance schmu at it, like a bcu can only be shot at if 1. the station/ihub is out of reinforcement and 2. all bcus have been agressed for atleast a minute.

Its a just a simple idea really, personally id prefer a sov system based on occupance instead of structures that need to be shot at and dreadded reinforcement timer, but there are plenty ways to share the load over several nodes with the current system, atleast i hope so :)

Cailais
Amarr
Nasty Pope Holding Corp
Talocan United
Posted - 2010.08.17 21:28:00 - [63]
 

Originally by: CCP Warlock


To be clear, while we can advise game design on their scaling constraints, it's their responsibility to design the game. (Well, apart from providing them with the occasional - 'don't even think of doing that to the cluster' moment.)

The real challenge is always to design the system itself so that it will scale, and also to provide a good game experience. Ideally the constraints, there are always some, are essentially part of the game experience and accepted as such. So to answer your question, we certainly don't like the times when we have to actually enforce hard limits, even if realistically there's no immediate alternative.

For the short term we are concentrating on improving server and cluster performance, and fixing the long lag problem. Medium term will involve recruiting more cpu by going to multi-core servers. Longer term redesigning fleet fights to scale indefinitely by changing the game mechanics to allow the cluster to distribute load arbitrarily in them is what I would personally describe as an interesting problem. A very interesting problem, if the goal is also to still deliver a good experience.




I would think that closer collaboration with game design is fundamental. The issue of 'lag' in fleet battles is one which has close parallels with 'real world' military tactics. Command, Control and Communication nodes are often overwhelmed with incoming information - not all of which is critical, reliable or even relevant. Equally distributing commands outwards to sub units suffers from a similar problem; overloading the network, leaving sub units paralysed and without clear direction.

Whilst this problem has largely been overcome in the modern age (certainly since the introduction of radio) prior to this directions or orders to sub units were typically given in advance of a given operation and constraints and freedom of action equally pre planned. To a great extent that still happens today.

In terms of the 'bunching' of military force this is sometimes very desirable (often termed 'concentration of force') but that's not always the case and the ideology (if you can call it that) of manoeuvre warfare is now the norm - in essence allowing a numerically inferior force to defeat a numerically superior force without engaging in a battle of pure attrition.

Id be interested to know what studies Game Design have done in this respect.

C.

Lykouleon
Wildly Inappropriate
Goonswarm Federation
Posted - 2010.08.17 21:34:00 - [64]
 

Came expecting a blog with shiny graphs and cool figures to look at.

Left VERY DISAPPOINTED!!!!!!!!

*ragequits*

Trebor Daehdoow
Gallente
Sane Industries Inc.
Posted - 2010.08.17 21:40:00 - [65]
 

Edited by: Trebor Daehdoow on 17/08/2010 21:41:14
Great blog, Jackie. As soon as you finished your informal presentation at the summit, I knew it would be a hit with the players.

If you could go into some more detail about the session-change issue, I think that would be much appreciated.

BTW, everyone should be sure to read the presentation that is attached to the blog. It goes into things in more detail, and provides important context for understanding this and other technical devblogs. And it is basically all shiny graphs.

CCP Warlock

Posted - 2010.08.17 21:40:00 - [66]
 

Originally by: Manaxus Stormwolf
hey there

I am a comp sci grad student at University of Washington CSS dept. Can you please send me a link to your doctoral paper you are referencing in the article? I'd like to read it.

Thanks!


For those of you who have been kind enough to ask - here is the link to my thesis.



Makko Gray
Pheno-Tech Industries
Crimson Wings.
Posted - 2010.08.17 22:18:00 - [67]
 

Edited by: Makko Gray on 17/08/2010 22:18:20
Originally by: CCP Oveur

You seriously do not want me to think about solving technical issues. Last time that happened unicorns died, dolphins left earth and war broke out in a galaxy far far away.


So somewhere between 1977 and 1985? Rolling Eyes

I would imagine database dependencies permiate far more that just cargoholds and hangers - how does eve remember how much armour and shield you have left for example, surely things like that are updated with each hit, and with command ships and gang links affecting the totals I'd imagine theres a lot of calculations just to see if there are any still in fleet and what their cumulative effects.

It's very easy as players to forget that almost every action or movement requires dozens, if not hundreds and even thousands of calcuations and that in most cases some of that information will need to be persisted somewhere. I also imagine that it can be easy for developers to forget that too, or at least difficult to remember or document all the individual instances so good luck finding the haystack let alone the needle.

Hienz Doofenshmirtz
Posted - 2010.08.17 22:31:00 - [68]
 

1 grats on the work at mit, interesting read. had a prof in university working on getting a computer to recognize any object in a three dimensional space with a single two dimensional reference picture.

2 thank you for all the work you guys do all the time.

3 that dev blog blew the hair back on half the people who read it as it passed right over their heads, just the way eve related things should.

Hexxx
Minmatar
Posted - 2010.08.17 23:16:00 - [69]
 

Originally by: CCP Warlock
Originally by: Manaxus Stormwolf
hey there

I am a comp sci grad student at University of Washington CSS dept. Can you please send me a link to your doctoral paper you are referencing in the article? I'd like to read it.

Thanks!


For those of you who have been kind enough to ask - here is the link to my thesis.





Did CCP pull a "Google-bait-and-switch" to get you?? Shocked

Respect! Your thesis turned my brains into a delicious strawberry-flavored smoothie drink. Embarassed

Hamish Nuwen
Gallente
Escuadron Federal de Asalto
Posted - 2010.08.17 23:41:00 - [70]
 

Originally by: Makko Gray
... and that in most cases some of that information will need to be persisted somewhere...

For that exists the memory hierarchy. Read the data once (from DB) and caching it in RAM (and probably at low level in mem caches l2 and l1).

If a change of session implies a change of node, and that requires rereading from the DB all the data related to ships and chars and fleets, then there you could have one of the sources of CPU spike in stargate jump lag, for example. The fewer DB transactions, the better perfomance in that scenario.

Bomberlocks
Minmatar
CTRL-Q
Posted - 2010.08.17 23:49:00 - [71]
 

Originally by: Trebor Daehdoow
Edited by: Trebor Daehdoow on 17/08/2010 21:41:14
Great blog, Jackie. As soon as you finished your informal presentation at the summit, I knew it would be a hit with the players.

If you could go into some more detail about the session-change issue, I think that would be much appreciated.

BTW, everyone should be sure to read the presentation that is attached to the blog. It goes into things in more detail, and provides important context for understanding this and other technical devblogs. And it is basically all shiny graphs.
The ironic thing, as far as I see it, is that although the eve server might be distributed, the client server relationship is still absolutely hierarchical. I honestly wonder if offloading some of the computation to clients wouldn't be a good thing.

CCP Atropos

Posted - 2010.08.17 23:53:00 - [72]
 

Originally by: Bomberlocks
Originally by: Trebor Daehdoow
Edited by: Trebor Daehdoow on 17/08/2010 21:41:14
Great blog, Jackie. As soon as you finished your informal presentation at the summit, I knew it would be a hit with the players.

If you could go into some more detail about the session-change issue, I think that would be much appreciated.

BTW, everyone should be sure to read the presentation that is attached to the blog. It goes into things in more detail, and provides important context for understanding this and other technical devblogs. And it is basically all shiny graphs.
The ironic thing, as far as I see it, is that although the eve server might be distributed, the client server relationship is still absolutely hierarchical. I honestly wonder if offloading some of the computation to clients wouldn't be a good thing.

But then you end up with the client being authoritative which is generally a bad idea.

Master Akira
Shiva
Morsus Mihi
Posted - 2010.08.17 23:57:00 - [73]
 

Originally by: CCP Atropos
Originally by: Bomberlocks
Originally by: Trebor Daehdoow
Edited by: Trebor Daehdoow on 17/08/2010 21:41:14
Great blog, Jackie. As soon as you finished your informal presentation at the summit, I knew it would be a hit with the players.

If you could go into some more detail about the session-change issue, I think that would be much appreciated.

BTW, everyone should be sure to read the presentation that is attached to the blog. It goes into things in more detail, and provides important context for understanding this and other technical devblogs. And it is basically all shiny graphs.
The ironic thing, as far as I see it, is that although the eve server might be distributed, the client server relationship is still absolutely hierarchical. I honestly wonder if offloading some of the computation to clients wouldn't be a good thing.

But then you end up with the client being authoritative which is generally a bad idea.


"Generally" meaning "all the cases where the player can gain an unfair advantage" by hacking the client, right?

So, are there ANY kind of loads the client could take from the server that don't fall on this category?

Luke S
Zeta Corp.
Posted - 2010.08.18 00:02:00 - [74]
 

Quote:
A number of developers have been working on this problem. A number of possible causes have so far been identified, tested, and turned out not to be the direct cause of this particular issue.


Wait a sec. You mean that some issues are being formed by another unknown factor? So its doing a domino effect?

CCP Warlock

Posted - 2010.08.18 00:04:00 - [75]
 

Originally by: Bomberlocks
Originally by: CCP Warlock
...I had a pet theory that it was a TCP rate adaption issue, in conjunction with a system lock affecting multiple clients. No such luck. We narrowed it down a little after I accidentally left myself logged on while stuck overnight, and came in the next morning and found I was successfully jumped into the system...
Firstly, I'd like to thank you for a forthright, highly detailed blog that exhibits none of the defensiveness we've come to expect, and one that comes straight to the point without trying to complicate the issue by making it too bland for us armchair software engineering warriors to read.

Then, I'd like to say that (from my utterly clueless perspective) a networking transport layer problem was also high on my list of probable contributing circumstances. You mention TCP rate adaption. I had a certain hunch that the statefullness of the TCP layer could be causing a feedback loop due to lost packets forcing retransmits and thereby slowing the network down under high load. This could appreciably happen at any point where TCP is being used, or between any points, or combinations of them, such as external networking equipment, the database network handling code (you've had problems with this before if I remember correctly; the starvation issue) and the various software processes that use networking code themselves.

If I'm not mistaken, the realtime game "time" is done in ticks, and the various processes that handle system, ship, fleet, grid etc are all statefull. If the ticks are stateless, i.e. they drive the whole game as a master time system, or even only a node, could it not be that the events or changes of state that are supposed to be being triggered are not receiving the information they need in time (especially if time in game process is handled statefully), leading to corrupt state in various objects?

Please forgive me again for my amateurish attempt to make a mental picture of what is happening, but certain correlations of the "lag", like the empty system lag backwash, or the empty system gate issues make me wonder if it isn't a combination of low level high load effects causing or contributing to software process degradation?

This was what caused me to ask in my post in CCP Tanis' blog how error or failure tolerant the code is, or how errors were handled in stateful objects.

I would be most grateful if you could take a minute or two to correct me in my idiocy.


I'm afraid if you want to protest idiocy, you're going to have to stop asking such good questions Smile

I guess the short answer is, clearly not fault tolerant enough at the moment. Although to be fair the system in general is remarkably fault tolerant. We do have tics on the servers, but they aren't hard, real time tics. That approach can work well for some constrained problems, but system wide synchronization introduces its own scaling issues for distributed systems, especially over WAN's, not to mention a really nasty impossibility proof.

TCP is the transport mechanism for all cluster communication, we then put our own message handling system above that, and then that's encapsulated in RPC semantics for the game developers. One fairly obvious explanation for the long lag issue is that its a deadlock problem, and a client getting rate adapted would provide an explanation for how it could get triggered, under a couple of peculiar to fleet fight situations we could think of.

Upshot was, a couple of our long suffering QA engineers had some fun with Netem, to no good result unfortunately. It was such a nice theory too, *sigh*

Taudia
Gallente
Sane Industries Inc.
Initiative Mercenaries
Posted - 2010.08.18 00:05:00 - [76]
 

Originally by: Master Akira

So, are there ANY kind of loads the client could take from the server that don't fall on this category?


There is one mentioned in the presentation. Both the clients and the server calculate the physics. Since the server uses locally calculated values there is no risk of tampering, while still avoiding the overhead of sending the results to the clients. This is as far as i can tell also what allows for desynching - if a calculation is not carried out in a deterministic way (i.e. the outcome is always the same), the client and the server may end up working from two different scenarios.

Trebor Daehdoow
Gallente
Sane Industries Inc.
Posted - 2010.08.18 00:17:00 - [77]
 

Edited by: Trebor Daehdoow on 18/08/2010 00:17:02
Originally by: CCP Atropos
But then you end up with the client being authoritative which is generally a bad idea.

This can still be a win if the computational cost for the server to verify that the client isn't cheating is less than the cost for the server to do the full computation, or if the client is used as a cache for information that the server creates.

It's the MMO version of Ronald Reagan's famous maxim: "Trust, but Verify".

A hypothetical example of the latter case would be bookmarks -- created and signed on the server, they could be stored on the client, exported, emailed, imported, whatever. The server doesn't need to know anything about them until they are used, and then only needs to verify the signature to determine that it is an authentic bookmark.

Just pick a key length that's long enough to be secure -- otherwise someone will hire a botnet to crack the key! Twisted Evil

Alekseyev Karrde
Noir.
Noir. Mercenary Group
Posted - 2010.08.18 00:24:00 - [78]
 

Edited by: Alekseyev Karrde on 18/08/2010 00:24:58
Originally by: CCP Greyscale
Originally by: Dragon Greg
[...]

But ultimately you have to consider humans being mostly sheep and following combinatory paths of least resistances and social security mechanisms in dealing with objectives game design could come up with. Think of it like the perpetual race between the bullet and the armour.

I'm really curious what the perspective of game design is on this, or rather, of the people responsible for overall product development. This is after all something that can touch on game design and product development format, if not principles. And it is an angle on conflict management which touches on quite a lot more then "just pvp", at minimum by analogy of principle it has a potential to affect all forms of conflict in the simulation.



This is something we think about a lot. I've talked about my thoughts on this stuff with a previous CSM (3 or 4 I think?), and there's some of it in this blog.


Speaking of that blog... Seeing these tech blogs is a very encouraging and needed step. Lag is without a doubt the most pressing issue right now. But it's not the only one; the issues of short-of-vision features (see, I can be nice), simple but important rebalances, and actual content improvement have not gone away. It would be warming, once this series of tech blogs is completed outlining anti-lag efforts, to see a similar thought perhaps not quite as exhaustive series for the other issues which are very much on our minds.

I could work with the current CSM to devise a shortlist if it would be helpful, but a fair number of them are obvious (rockets, af, outpost hp, docking games, etc) and have been beaten to death at this point. So by all means pleasantly surprise us.

EDIT: add the forums to that list, they ate my OP. Good thing Chrome saves field text

Kamyyn Foiritain
Posted - 2010.08.18 00:29:00 - [79]
 

Originally by: CCP Warlock
Originally by: Manaxus Stormwolf
hey there

I am a comp sci grad student at University of Washington CSS dept. Can you please send me a link to your doctoral paper you are referencing in the article? I'd like to read it.

Thanks!


For those of you who have been kind enough to ask - here is the link to my thesis.





So I don't really know nuthin' about no computer programing but I scanned through your thesis anyways.

From what I can tell your building Skynet?

TorTorden
Amarr
Posted - 2010.08.18 00:55:00 - [80]
 

So one thing to consider, since I have only glanced at the first chapter in the thesis.
Of (in future of course), having the pilots\computers on grid speak to eachother than rather having all of them first feeding information to the server wich we can for all purposes say are experiencing overload, and then sitting around waiting to be responded to with information from the server.

Of course many more questions arise, like How to decide things like alfred gets to lock on betty and can fire off his row of 1400mm's. or How to balance the load from having harry's p4 computer with 2mbit adsl do less shared work for the cloud than Betty's i7 on synchronous 4MB\s FIOS line.

Probably lots more as well.

PS: I don't get it why everybody automaticaly assumes an AI would be the mechanical version of son of sam, or mass murdering psyko's in general.
And skynet itself wouldn't have been much more than a big trojan if they hadn't given it the launch codes.

CCP Warlock

Posted - 2010.08.18 01:00:00 - [81]
 

Originally by: Kamyyn Foiritain


So I don't really know nuthin' about no computer programing but I scanned through your thesis anyways.

From what I can tell your building Skynet?


Truth to tell, I was actually thinking of space robotics when I proposed it. Build cheap and most importantly light little robots that could then co-operate together to do more complicated things. Get something like that up there, and we might start making some progress again.

But now that you mention Skynet, and especially in its Terminator 3 incarnation, there is a small element of that too.

I'd be the first to admit that it's a somewhat weird approach, and probably only practical for some fairly particular problem domains, but self-organisation should be one of them.

Xessej
Posted - 2010.08.18 01:41:00 - [82]
 

Originally by: CCP Warlock
No, sorry I can't. I don't like to make commitments of any kind until we are absolutely sure we can nail it to the wall, and there are a lot of uncertainties in this one - as well as the need, as earlier posters pointed out, of very carefully and thoroughly testing it out.

There is a lot of motivation to do this though, the very least of which is there are some very cool server toys, I mean advanced high performance multi-core systems it will let us play with.

Let me strongly suggest that rather than redesigning the architecture to allow a solar system to run on multiple cores but to instead approach the problem from making each grid a distinct process that can be moved between cores as needed. This approach is one that allows dynamic load balancing.

The trick would be to dynamically copy the running process in real time to the new core without desyncing the clients.

Zendoren
Aktaeon Industries
The Black Armada
Posted - 2010.08.18 02:36:00 - [83]
 

Edited by: Zendoren on 18/08/2010 03:08:44
Originally by: CCP Oveur
Within "HPC" we've done quite a few things but Infiniband specifically is still being considered. It wouldn't give any huge benefits today just by switching to Infiniband (none even I think), it's more that it opens up new possibilities between physical machines with the associated software changes.

But the first step, and greatest future potential, is to get things to fully capitalize on multi-processor and multi-core. Like spanning a solar system over multiple cores.


THIS!!!!!!!!!!!! to the Nth degree...

So, this begs the question! Has the EVE Online Core infrastructure gurus started prepping the code for this????

Edit:

I looked over your presentation and judging from the content its obvious that you are looking at the issue of lag as a fundamental limitations of the client server topography.

While reading over the slides an idea to use a combination of client server and Lancasts came to mind. For example, groups of players are bundled together in a small fully connected mesh that is dictated by the latency between them. Have the lowest latency client in the group (with respect to the eve server) handle the communication for the entire lancast group. This will allow you to bundle and compress data on the higher layers of the ISO model.

Don't know if this idea was helpful but I thought I should throw it out there.

Sooche Mo'Freed
Posted - 2010.08.18 03:14:00 - [84]
 

CCP you are going to find out that the client is causing the lag.

onyu
Posted - 2010.08.18 04:48:00 - [85]
 

Thanks for the fabulous blog... and the interesting discussion thread following it!


I'd like to point out that there are client optimizations possible without making the client 'authoritative', just by fixing errors in scripts and order of execution.

Set up a few devs with an EVE client on netbooks over a really crummy connection. (give them internet lag, not server lag and a computer than can run eve but just barely)

This will make it easy to see a whole slew of places where the EVE client is hitting the server without need and other procedural problems.

This isn't your big quantum leap fix, which we all hope you pull off (The military might license your technology for more money than eve will ever earn)


Reducing needless client hits on the server would seem a worthwhile incremental improvement, not to mention causing players less frustration with a client that stops responding all the time waiting for non critical server requests.

You could even start denying non-combat database requests when a node reaches a certain load. Seems perfectly well if market and contract systems simply close shop when a fleet battle is raging in a system.

I think devs are aware of players creating contracts or checking market price after market price in a fleet fight system, just to bring the node down etc. so...


Estimated Prophet
Ye Olde Curiosity Shoppe and Trading Company
EVE Trade Consortium
Posted - 2010.08.18 05:29:00 - [86]
 

Originally by: Trebor Daehdoow
Edited by: Trebor Daehdoow on 18/08/2010 00:17:02
Originally by: CCP Atropos
But then you end up with the client being authoritative which is generally a bad idea.

This can still be a win if the computational cost for the server to verify that the client isn't cheating is less than the cost for the server to do the full computation, or if the client is used as a cache for information that the server creates.

It's the MMO version of Ronald Reagan's famous maxim: "Trust, but Verify".



If you like sci fi I recommend Halting State by Charles Stross. In it his (fictional) distributed MMO platform would farm out a computation to 3+ clients, and if they all returned the same result it was accepted. So you could be in Jita, playing with the market, while your client processes random calculations for someone ratting in Delve, a fleet fight in Scalding Pass and someone missioning in Motsu. Of course you then have network latency to contend with - I'm in Australia and my ping is 330 on a good day; very different to packets passing between computers in the same rack in a data-center.

AeonOfTime
Minmatar
Syrkos Technologies
Joint Venture Conglomerate
Posted - 2010.08.18 06:06:00 - [87]
 

The solution is right under your noses, and CCP Warlock even mentioned it in his excellent post:

Quote:
There can also be Schrodinger effects, where examining the state of the system changes its behaviour enough that the issue doesn't manifest itself.


Laughing

Vaal Erit
Science and Trade Institute
Posted - 2010.08.18 07:13:00 - [88]
 

I was kinda hoping for a dev blog like:

Okay we found the cause of lag.......it's the users. &#@^ off.

Nareg Maxence
Gallente
Posted - 2010.08.18 07:28:00 - [89]
 

Originally by: CCP Oveur
Originally by: Mynxee
Thrilled to see this dev blog, thrilled to check it off as Delivered on the CSM's "CCP Deliverables List" from the Summit, and now actually gonna get a nice cup of coffee and read it. Thanks, Warlock.



Why isn't that list in the Evelopedia?

Why isn't there a link that says Evelopedia with big friendly letters on the main EVE website?

(Not counting the Item Database link.)

Lord Zim
Goonswarm Federation
Posted - 2010.08.18 07:55:00 - [90]
 

Getting the server code to span multiple cores is probably going to lift the limits by which we're ... erm, limited by. However, I'm not so sure that's what should be the main focus to start with, although once the cause for long session changes is found then getting the code multicored should probably be next on the list to fix module lag.

Having said the obvious, I have a few questions.

1) Does SiSi have multiple nodes? I've been able to jump into systems with 500+ easily, and load grid within a minute, and I've jumped into systems with 50 (not fighting) and not loading. I've also been kept in the old system for 20-30 minutes while jumping out, while watching everyone else at the gate (hostiles and friendly) warp around and shoot eachother just fine, only to load almost instantly once I'm "out of" the old system. So my theory is that the problem may be hard to reproduce on SiSi because you're jumping from one system to the next within the same physical node, and thus not triggering the specific issue that was introduced (or worsened, I really don't care about the distinction) in Dominion. I assume this has been thought of, but I just want to throw this in there just in case it hasn't, because I remember seeing it mentioned somewhere that sisi and tq aren't quite as equal as might be desired for testing complex issues like we're looking at now.

2) I haven't kept up 100% on what has been fixed, but the NC vs white noise fight in 6NJ (I think it was) where the titans were rubberbanded back into the system, has that problem been isolated and fixed?


Pages: 1 2 [3] 4 5 6 7

This thread is older than 90 days and has been locked due to inactivity.


 


The new forums are live

Please adjust your bookmarks to https://forums.eveonline.com

These forums are archived and read-only