Francisco Matias Cuenca-Acuna and Thu D. Nguyen
Department of Computer Science, Rutgers University
110 Frelinghuysen Rd, Piscataway, NJ 08854
Appears in Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing (HPDC), August, 2001.
We consider the use of cooperative caching to manage the memories of cluster-based servers. Over the last several years, a number of researchers have proposed content-aware servers that implement locality-conscious request distribution to address this memory management problem [2,18,4,5,8]. During this development, it has become conventional wisdom that cooperative caching cannot match the performance of these servers . Unfortunately, while content-aware servers provide very high performance, their request distribution algorithms are typically bound to specific applications. The advantage of building distributed servers on top of a block-based cooperative caching layer is the generality of such a layer; it can be used as a building block for diverse services, ranging from file systems to web servers.
In this paper, we reexamine the question of whether a server built on top of a generic block-based cooperative caching algorithm can perform competitively with content-aware servers. Specifically, we compare the performance of a cooperative caching-based web server against L2S, a highly optimized locality- and load-conscious server. Our results show that by modifying the replacement policy of traditional cooperative caching algorithms, we can achieve much of the performance provided by locality-conscious servers. Our modification increases network communication to reduce disk accesses, a reasonable trade-off considering the current trend of relative performance between LANs and disks.
We consider the use of cooperative caching to manage the memories of cluster-based servers. Over the last several years, the number of Internet users has increased rapidly, necessitating the construction of giant-scale Internet servers . To achieve the necessary scale and performance, service providers have no choice but to use large multiprocessor systems or clusters. While clusters promise better scalability, availability, and performance vs. cost , their distributed memory architecture often makes building scalable services more difficult. In particular, if the memories of individual nodes are used as independent caches of disk content, servers perform well only when their working sets fit into the memory of a single node, limiting system scalability [18,5].
To address this problem, a number of researchers have proposed content-aware servers that implement locality- and load-conscious request distribution [2,18,4,5,8]; that is, these servers use information about the content being requested and the load at each node in the cluster to choose which node should serve a particular request. This allows the server to explicitly manage node memories as an aggregate whole rather than independent caches. In this paper, we propose a different approach: the use of a generic (but potentially configurable) cooperative caching middleware layer to manage the memories of cluster-based servers.
The advantage of using a generic middleware layer (or library) lies in its generality: it should be usable as a building block for diverse distributed services, reducing the effort necessary to design and implement cluster-based servers. On the other hand, the disadvantage is that its generality may hurt performance. For example, our middleware layer implements a block-based cooperative caching protocol to maximize generality. Handling blocks may be inefficient, however, for servers that always use entire files such as web servers.
While we believe that it is worthwhile to trade some performance for ease of design and implementation, the question remains: how much (if any) performance would we have to sacrifice? To explore this question, we compare the performance of a server that uses cooperative caching to that of a content-aware server. In particular, we simulate a web server built on top of a block-based cooperative caching layer to L2S, a highly optimized distributed server that implements both locality- and load-conscious request distribution .
Our specific contributions include: (1) working out the details necessary to adapt previous client-based cooperative caching algorithms to derive an algorithm for a distributed cluster-based server, (2) modifying the traditional LRU-based global replacement algorithm to improve the performance of cooperative caching in the context of cluster-based servers, and (3) comparing the performance of cooperative caching with that of a content-aware server for serving static web content.
Our simulation results for a set of 4 web traces show that servers employing traditional cooperative caching algorithms will likely perform significantly worse than locality-conscious servers. However, a small modification to the replacement algorithm to keep the last copy of a block in memory whenever possible leads to dramatic increase in throughput. In fact, our results show that a web server employing our modified cooperative caching algorithm can achieve over 80% of L2S's throughput in almost all cases and over 92% of L2S's throughput in most cases. Considered in the light of the fact that currently, the simulation of L2S includes two implementation advantages that may account for this performance difference, including the use of TCP hand-off  and the assumption that the entire web content is replicated on the disk of each node in the cluster, our results show that cooperative caching has the potential to fully match the performance of locality- and load-conscious request distribution. These results strongly support further exploration of the use of a middleware cooperative caching layer as a building block to ease the task of designing and implementing scalable cluster-based services.
The remainder of the paper is organized as follows. Section 2 discusses related work. Section 3 describes our cooperative caching layer in more detail. Section 4 describes our simulation system. Section 5 presents simulation results and discusses our modification to traditional cooperative caching algorithms. Section 6 concludes the paper and discusses future work. Related Work
Cooperative caching has been used to improve client access latency and reduce server load for some time [14,11,19]. The basic algorithm of our cooperative caching layer derives from this body of work. Our work differs from these efforts in that we concentrate on the aggregate performance of a distributed server whereas they mostly concentrate on improving the performance of individual client nodes and reducing server load. Further, system parameters for a network of clients differ significantly from those of a cluster-based server. These differences lead to different trade-offs in the caching algorithm.
Our work is also related to efforts to design and build cooperative caches for distributed file systems [6,1,10]. However, these efforts did not consider whether cooperative caching is competitive with content-aware request distribution. Further, we are interested in building a cooperative caching middleware layer that can be used by diverse distributed services, not just file systems.
In the context of Gigantic Internet servers , Fox et al. suggest using dedicated nodes for caching data in order to help I/O-bound or CPU-bound nodes . In their work, they implement such a middleware layer and provide an API for programmers to use their caching services. Our work could be used in a similar manner. However, we are more interested in a layer that could be used as a library module as well as an independent middleware service of its own. Overview of the Cooperative Caching Middleware (CCM) Layer
CCM is a block-based cooperative caching layer; it accepts requests for specific file blocks and replies with the appropriate data. Block requests can arrive at any node in the cluster but each file is stored on only one disk in the cluster. The node where a file is stored on disk is called its home. Currently, we assume a read-only request stream (and so have not implemented a write protocol) since we are studying CCM in the context of a web server serving only static content.
There are two types of blocks in CCM: master blocks and non-master blocks. When a file block is first read from disk to be cached in memory, it is designated a master copy. Copies of a master copy are called non-master copies. When a request for a block b in file f arrives at a node n, if n has a copy of b in its memory, then it services the request right away. Otherwise, n uses a system of hints to locate the master copy of b, bmc. If bmc is currently in the memory of some node m, then n requests a non-master copy of b from m. On receiving m's reply, n keeps the copy of b in its memory and services the request. If n cannot locate bmc in cluster memory, then n requests a copy of b from f's home node. If the home node is able to locate bmc in cluster memory, then it forwards a request to the caching node on n's behalf. If it also cannot locate bmc, then it reads b into memory and sends it to n. This copy is now designated the master copy of b in memory. The home does not keep a copy of b; only n does.
In its most basic form, CCM employs an approximate global LRU replacement scheme. As shall be seen in Section 5, this replacement algorithm must be modified to achieve good performance. Each node tracks the age of the oldest block of each of its peers. When a node brings a block to its memory to service some request, if its memory is full, it evicts its oldest block. If its oldest block is a non-master copy or is the oldest block in the system, then it simply drops the block. If, however, the node's oldest block is a master block and one of its peer has an older block, then it forwards the evicted master block to the peer with the oldest block. When a node receives a forwarded block, it must drop its oldest block to make place for the forwarded block. Two important properties should be noted: (1) blocks forwarded to peers do not cause cascaded evictions, and (2) when a forwarded block arrives at its destination, all blocks at the destination may now be younger than the forwarded block; in this case, the forwarded block is dropped.
Given the above description of CCM's block-based cooperative caching algorithm, two questions remain: (1) how does a node n locate some block b that is not resident in n's memory but may be in the global cache somewhere? and (2) how does each node track the ages of the oldest blocks at its peers? CCM uses a hint system similar to that described in  to address both of these issues.
CCM implements two types of hints: location hints and age hints. Whenever the home node of a file f reads a master copy of a block bmc and forwards it to a node n, it inserts a location hint saying that n is caching bmc into a local pool of such hints. Whenever a node m needs to service a request for block b, it first looks in its local cache. If b is not there, it looks for a location hint for b in its local pool of hints. If there's a hint saying that bmc is resident at node n, then m would request a non-master copy of b from n. If m does not have a location hint for b, it sends a request for b to f's home fhome. When the request arrives at fhome, if fhome contains a location hint for bmc, then fhome would forward the request to the caching node, say n. If n still has bmc in its memory, then it sends a non-master copy to m. When m receives b, it puts a hint saying that bmc is at n into its local pool of hint. If n has forwarded bmc to a peer p, it would have kept a hint saying that bmc is now at p. It uses this hint to pass on m's request to p. If n previously evicted bmc without forwarding it anywhere, n would now notify fhome that it no longer has bmc, at which point fhome would read bmc from disk, forwards it to m, and update its hint to say that m now has bmc. Figure 1 illustrate this process for one example sequence of requests and evictions. Note that hints are updated lazily: a node caching bmc does not tell fhome if it decides to evict or forward bmc until it receives a request for b. In fact, the home is never notified of the forwarding of master copies; each request simply follows a chain of hints to the current node caching the master copy of a block.
Nodes use age hints to track the age of the oldest block at each of its peers. Age hints are exchanged as follows: when a node forwards a master block to a peer because it is evicting that block and, according to its age hints, the peer has an older block, it piggybacks the age of its oldest remaining block. When the peer receives the forwarding message, it replies with the age of its oldest block, taking into account the age of the block that it just received. As Sarkar and Hartman noted in , under this exchange protocol, a node that is evicting blocks frequently will be updating its hints often while a node that doesn't have much activity will tend to be a target for remote evictions and therefore it will be receiving up-to-date age hints as well. If a node does not have age hints for all peers, it chooses one of the peer for which it does not have an age hint and forwards the evicted master block there. In this way, eventually, each node will gather age hints for all peers.
Orthogonal to the issue of cache management is the issue of file distribution and location. Currently, we assume the general case of files being distributed across all nodes, with each node having a copy of the global file-to-node mapping. The actual file distribution and location scheme can be implemented in a variety of different ways, depending on where CCM is employed. For example, in Archipelago, Ji and Felten use a hash to map pathnames to cluster nodes  .
We compare CCM's performance to L2S, a highly optimized content-aware server that uses locality- and load-conscious request distribution to provide good performance in a wide range of scenarios . In particular, L2S tries to migrate all requests for a particular file to a single node so that only one copy of each file is kept in cluster memory. If a node becomes overloaded, however, L2S will replicate a subset of the files, sacrificing memory efficiency for load balancing.
L2S uses whole files as the caching granularity, employing a custom de-replication algorithm instead of block replacement. This algorithm behaves like local LRU, but incorporates load statistics and tries to keep at least one copy of each file in memory whenever possible.
In addition to differences in the memory management strategy, L2S also differs from CCM ``implementation-wise'' in two respects. First, L2S uses TCP hand-off to migrate client requests. Second, all simulations of L2S currently assume that each node contains the entire web content on its disk (although this assumption is not inherent to L2S). We discuss the potential implications of these differences in Section 5.
Our simulator derives from the one used to study L2S in . It is event driven and models hardware components as service centers with finite queues. Using this framework, we model a distributed web server running on 4-8 cluster nodes connected by a high-performance LAN. Clients are connected via a router with round-robin DNS capabilities . Figure 2 shows this setup. Note that, currently, we assume the same network is used to field/service client requests and for intra-cluster communication. This assumption does not affect our results significantly as the network is never the performance bottleneck.
At a more detailed level, each node is comprised of a CPU, memory, a NIC, and a disk, all connected by two buses, a system bus and an I/O bus. Each of these device is modeled as a service center with one or more finite queues. Each queue typically represents a particular subset of the possible events that use that resource. For example, arriving client requests are placed in one queue, peer block requests are placed in another queue, etc. These queues really model the capability of each resource to multi-task and are necessary to avoid simulation deadlocks. We model contention for the I/O bus explicitly but the effects of contention for the memory bus and all other costs of memory accesses are implicitly included in the CPU overheads for various processing events.
The modeling constants for all the major simulation components are shown in Table 1. The block-based operations are specific to CCM. The parsing and serving times represent the time to parse URL requests and the time to actually send content cached in local memory in reply to a request. Overall, our simulation parameters approximate a Cisco 7600 router , a VIA Gb/s LAN , an 800 MHz Pentium III CPU with 133 MHz system bus, and an IBM Deskstar 75GXP disk ; we derived these parameters (with help from Bianchini and Carrera ) using careful single-node measurements and some extrapolation. In order to model file system effects, we charge an extra seek for accessing the metadata on every 64KB access; we also assume that the file system provides a pre-allocation mechanism that ensures that files will be contiguous within 64KB blocks. The values presented in Table 1 for the modeled disk are probably on the conservative side. We chose these parameters as they are comparable to those used for L2S (and LARD ), making it easier to compare simulation results.
We use four web traces obtained from the University of Calgary, Clarknet (a commercial Internet provider), NASA's Kennedy Space Center, and Rutgers University to drive our simulator. Table 2 gives relevant details on these traces. The Calgary, Clarknet, and NASA traces were studied in detail in .
We use these four traces because they have relatively large working set sizes compared to other publicly available traces. For example, Figure 3 shows the cumulative distribution frequency and size for the Rutgers trace. Observe that in order to cache 99% of the requests, 494MB of memory is needed.
Even though we chose the biggest traces available, their working sets are still small. Thus, we found it necessary to simulate small memories (4-512 MB per node) to reproduce situations in which the working set size is larger than the aggregated memory of the cluster.
To measure the maximum achievable throughput of the cluster, we ignore the timing information present in the traces. Each HTTP client generates a new request as soon as the previous one has been served. We also measure throughput only after the caches have been warmed up in order to reflect their steady-state performance. Results
We have simulated CCM and L2S on clusters of 4 and 8 nodes. Figure 4 shows the throughput achieved by L2S and three variants of CCM when running on 8 nodes with varying amounts of memory; results for 4 nodes show the same trends and so are not shown here because of space constraints2. The algorithm described in Section 3 is labeled as CCM-Basic; the other variants are discussed below. In our discussion of these results, we will concentrate on the cases where the disk is still part of the performance bottleneck; that is, we will derive most of our observations from the portions of the throughput curves where performance is not yet saturated due to CPU overheads.
From the above results, we observe that CCM-Basic's performance lags that of L2S significantly. In many cases, CCM-Basic only achieves about 20% of L2S's throughput. When we looked closely at the gathered statistics, the following reasons for CCM-Basic's poor performance became apparent:
We believe that the first problem arises from an inaccuracy in our simulator: a reasonable system would likely implement some form of request scheduling, caching, and/or prefetching and so the seek penalty would not be nearly as large. To correct this inaccuracy, we implemented a simple scheduling algorithm in our queue of disk requests; whenever a block is read from a 64KB chunk of a file, if there are requests for any other blocks within that chunk, they are served at the same time in order to avoid additional seeks. This leads to the CCM-DS performance curves, which are better than CCM-basic but still significantly worse than L2S.
This led us to examine the second problem, postulating that in a server environment, where network performance is increasing rapidly relative to disk performance, stronger preference for keeping master blocks in memory at the expense of traversing the network more often for blocks from hot files would lead to increased performance. There were many potential ways to modify the replacement algorithm. We chose a simple adaptation to explore the correctness of this theory: when eviction is necessary, never evict a master copy if the evicting node is still holding a non-master copy; instead, evict the oldest non-master copy. If the node is only holding master copies, then perform the global LRU eviction as before. This modification leads to the CCM performance curves.
Figure 5 shows the effect that this modification to the replacement algorithm has on the local vs. global block hit rates. Observe that, indeed, CCM-Basic has higher local hit rates than CCM but that the global hit rate for CCM is higher than that of CCM-Basic. Further, the in-memory hit rate (local + global) is significantly higher for CCM, which is as expected since CCM was modified specifically to increase the size of the data set cached in memory.
Note that CCM's replacement algorithm is rather extreme; it leads to all memories holding only master copies, which does not necessary lead to best performance since nodes have to fetch data remotely to serve most requests (see Figure 5(b)), paying the attendant communication and processing costs. An algorithm that allows some discard of master blocks before all non-master blocks are gone may lead to better performance.
Despite CCM's limitation, however, it is a good starting point to evaluate whether cooperative caching can be more competitive with locality- and load-conscious request distribution. Clearly, it can! CCM is quite competitive with L2S, achieving over 80% of L2S's throughput in almost all cases, and achieving over 92% of L2S's throughput in most cases3. To show these differences more clearly, Figure 6 normalizes CCM's throughput against that of L2S. Except for systems with very small memory sizes (4MB per node and sometimes 8MB per node), CCM is very competitive with L2S. Moreover, L2S currently has two performance advantages over CCM: (1) L2S uses TCP hand-off, and (2) current simulations of L2S assume that files are replicated on all disks in the cluster. Bianchini and Carrera have shown that a server that doesn't make use of TCP hand-off can loose up to 7% throughput in one particular cluster environment . Figure 7 shows an estimate of the performance advantage of full data replication on disks for the Rutgers trace when running on 8 nodes. In this figure, we plot normalized throughput for two additional variants of CCM, an optimistic version (CCM-O) where hints are always accurate at the time when they are used (although actions based on these hints may turn out to be incorrect because of changes in the system from when the hints were consulted to when a resulting action is completed), and the optimistic version with full replication (CCM-OF). CCM-OF gives an additional 3-4% performance improvement over CCM-O (over all traces). It is not clear that CCM can leverage full file replication on disk to improve performance since not going through the home to read master blocks from disk may lead to multiple master copies for a single block in memory--we are currently investigating how to leverage such replication in CCM. Further, the 7% performance advantage of TCP hand-off clearly does not translate literally to our experiments. These differences argue, however, that we may be able to further improve CCM performance so that there is little or no performance loss from that of L2S.
Figure 8 shows that CCM achieves much of L2S's performance because it makes efficient use of main memory, providing close to L2S's byte hit rates, albeit most are global hits. (As already noted, in some cases, CCM achieves better byte hit rates than L2S.) Further, CCM's hit rates come close to the theoretical maximum possible; for example, CCM's hit rate for the Rutgers trace is 96% with 64MB per node compared to the theoretical maximum of 99% for 512MB of total memory (see Figure 3).
Figure 9 shows the effectiveness of the location hint system, plotting the number of hops a request must travel on average before the master copy is found (or before the request is rerouted to the home to be serviced from disk). This figure only include requests that miss in the local cache of a node and so the minimum number of hops is 1. This data explains why CCM can perform poorly when system memory is severely limited; hints become stale rapidly, causing requests to inaccurately ``chase'' the caching of blocks through the system. Further, age hints also become stale rapidly, leading to incorrect eviction of master blocks. On extreme cases the error rate reaches 20% of incorrect evictions. blocks. Figure 10 shows the loss of throughput due to the overhead of maintaining hints and hint inaccuracies by comparing CCM's performance against that of the optimistic version of CCM (CCM-O). This Figure clearly shows that, in the case where CCM performs badly compared to L2S, it also performs badly compared to CCM-O and so the performance loss can be attributed to failures of the hint system.
Surprisingly, CCM's complete lack of load balancing does not hurt its performance compared to L2S. This is because the round-robin distribution of requests diffuse the hot files throughout the cluster. Thus, no single node is overwhelmed by peers' requests for copies of hot blocks.
Of course, this round-robin request distribution coupled with CCM's complete favoring of master blocks may lead to increased response time (because of the additional communication needed to follow hints and to obtain copies of needed blocks from peers). Figure 11 plots CCM's average request response times, normalized to those of L2S, against varying memory sizes. These results show that there is some degradation of response time--in most cases, less than 20%. At very small memory sizes (e.g., 4MB per node), CCM's response time can degrade more significantly (20-70%) because of the inaccuracy of the hints. In one case, the Clarknet trace, CCM achieves better response time than L2S when there is more than 8MB of memory per node because of its greater byte hit rates. At least a portion of these differences is likely due to artifacts of our simulation system; we had to model CCM at a more detailed level in the simulator (because of the many more possible events in a block-based system), creating greater potential for inter-request interference. Thus, we do not believe that the difference in response time is significant, except in the cases of very small memory sizes.
To examine CCM's scalability vs. cluster size, we simulated the Rutgers trace on systems with up to 32 nodes, with 32MB of memory on each node. We only look at the Rutgers trace because the other traces' working sets are too small for such a scalability study. Figure 12(a) plots the resulting throughput, showing that CCM scales quite well up to 32 nodes.
Finally, we wanted to examine the effects of CCM's trade-off of increased communication for greater in-memory hit rates. Figure 12(b) plots the average disk, CPU, and network utilization of a cluster of 8 nodes running the Rutgers trace. Clearly, given the relative performance of Gb/s LANs and disks, this was the right trade-off. We were curious to see what would happen if we considered the use of only a 100Mb/s network. Figure 13 shows the resulting throughput and resource utilization for the Rutgers trace running on 8 nodes. Even with this slower network, communication is not the performance bottleneck (although it is a close second to the CPU at larger memory sizes).
In summary, we have shown that a server employing cooperative caching has the potential to achieve much of the performance of locality- and load-conscious servers. Given the parameters of current clusters (e.g., Gb/s LANs), it is important to modify the cooperative caching algorithm from the ones proposed previously for client-side cooperative caching. In particular, our results point to the need for strongly avoiding the eviction of master copies, which leads to disk accesses. Our modified algorithm, CCM, achieves similar hit rates to a server implementing locality- and load-conscious request distribution, L2S, by ensuring that memory is first used to hold the working set of master copies before replicas are made. Of course, this trades increased network communication for higher in-memory hit rates. The low CPU overhead for network communication that comes with modern high-performance LANs is probably critical to the success of this trade-off.
Conclusion We have studied the performance of cooperative caching in the context of scalable Internet servers; specifically, we have compared the performance of a web server built on top of a generic block-based cooperative caching layer to that of a server employing sophisticated locality- and load-conscious request distribution. Our results show that, when combined with an off-the-shelf web server and round-robin DNS, cooperative caching has the potential to achieve much of the performance of locality-conscious servers. This trade-off of a small (if any) amount of performance can be extremely beneficial as it means that designers of Internet services can use a generic cooperative caching layer as a building block instead of re-implementing service-specific request distribution and cache coherence algorithms. Beyond this paper, we plan to investigate how to support writes as well as reads in CCM. We will also investigate how to parameterize CCM so that it can be adapted to particular applications. For example, we will investigate whether CCM can easily be adapted for servers that always use whole files (e.g., a web server) and whether such an adaptation would improve performance. Finally, it would be interesting to consider how cooperative caching might be used together with content-aware request distribution in generic middleware layers.
This document was generated using the LaTeX2HTML translator Version 2K.1beta (1.47)
Copyright © 1993, 1994, 1995, 1996,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -nonavigation -math -html_version 4.0 -split 0 paper.tex
The translation was initiated by Francisco Cuenca-Acuna on 2003-04-25