P2P Network Proposal
From OpenAerialMap
Contents |
Future directions: P2P extensions?
One of the main bottlenecks in the previous development of OAM has been the issue of scalability and reliability. A storage network composed of volunteered nodes in a hybrid peer-to-peer configuration will hopefully address some of these issues. Client access to the storage network will be through the usual web map tile service APIs, although source imagery can be stored in the network and retrieved from the network as well.
This document is intended as a supplement to the draft Technical Proposal, and a recommendation for future development of the project past the initial stages described there.
Data Integrity
Public key infrastructure
When a user is created, a crypto key pair is generated on the server side. The server encrypts the private key with the user's password, and never saves the password text, only the hash.
The server publishes a list of users and their public keys. Any user can download their private key by entering their password into the site.
(A possibly more sensible idea would be to ship a configuration utility with each storage node that allows a user to generate an unencrypted private key on their local end, autoconfigure the node to use it, and then one-time prompt for the user's email and password in order to be able to register the public key with the metadata server.)
The server publishes its own signed public key and a server-key-signed list of revoked public keys. Administrators can also revoke user keys on the server side (in case a user key is compromised or misused). Users should be able to request revocation of their keys.
Source imagery
When source imagery is uploaded to the storage network, the source image URI and metadata should be registered with the metadata server, so that it can return a list for any given layer on request. A hash of the image file signed by the uploader should be registered as well.
Tiles
Tiles are stored in the storage network and are immutable. Once a tile is created and associated with a layer, it cannot be removed from the network, except by a metadata server signed key revocation. (One exception to this rule, in order to ease the introduction and performance of signing tiles, might be to allow the expiration of unsigned tiles in favor of a signed tile for the same layer and grid cell.)
Tiles must have an iTXt chunk containing the tile metadata. Tile metadata includes the layer ID, layer depiction date, and the X/Y/Z coordinates of the tile in the tile grid.
Tiles must also have an iTXt chunk containing a signed hash of all of the other data chunks and the fingerprint of the key used to sign the image.
Storage Network
Each leaf node will be both a storage peer and an HTTP tile (WMS-C, TMS, KML SuperOverlay) server. The HTTP server will need to be able to respond to all HTTP/1.1 requests on port 80 addressed to tile1.oam.org, tile2, tile3 and tile4. Alternately we should pick an unused high TCP port for the HTTP server to sit on and use that, meaning that we can offer standalone node servers without requiring root access to run them.
P2P exchanges between nodes can probably just use HTTP for transport.
Configuration
Storage nodes will minimally need to be configured with local storage allocation (in GB). A sensible default (e.g. 10 GB) can be shipped with the out-of-the-box configuration.
A storage node should be also configured with the coordinates of its geographic location. If unconfigured, the metadata server will assign a plausible default, as described below.
Storage nodes should also be configured with a bandwidth throughput weighting, which it should report to the server on startup. If left unconfigured, the out-of-the-box configuration should use a fairly low default (e.g. <1Mbps).
A storage node may or may not be a trusted node. In order to set up a trusted storage node, a user has to register with the server and store their private key in their node configuration. Nodes that are not trusted may join the network, cache tiles, and deliver them, but will not be regarded as available for the purposes of redundant storage in the network. Nodes run out-of-the-box will be untrusted.
Operation
At start up, the storage node should request the following lists from the metadata server:
- All nodes in the network
- All layers validated on the metadata server
- All signed layer revocations
The node should refresh these lists periodically. The server should provide the lists in a differential format by returning only adds and removes when a list update is requested to keep bandwidth usage down. This request can also serve as a heartbeat to let the metadata server know that a node is still up. The node can also cache the list locally in order to be able to make differential requests across restarts.
When a storage node pings the network server, it should provide its configuration details as part of the request. If the node does not know its geographic location, it should query the metadata server and be assigned coordinates based on its physical IP address location.
A node is considered "available" for redundant storage if it has remaining storage space and is trusted.
A node regards itself as "local" for every grid cell in the pyramid to whose centroid the node is one of the k-nearest *available* nodes geographically. Additionally, if a grid cell contains the node's location, that cell is always local to the node. All non-full nodes must cache tiles that are local to them. Nodes should cache non-local tiles in whatever storage allocation remains on the node.
When a tile is uploaded to a storage node, it should check the signature on the tile. If the key fingerprint is unknown to the node, it should query the metadata server for the public key matching the fingerprint, and cache the result. If the signing key is unknown to the metadata server or has been revoked, the node should reject the tile. If the tile signature is good, the node should propagate the tile to the storage network. If the tile is local to the node, it must be stored there as well. A node should never overwrite a tile it already has stored, except when a signed tile replaces an otherwise equivalent unsigned tile.
If a storage node runs out of space to store local tiles, it can declare itself full and refuse to store a tile or a raw image. Local tiles or images should nevertheless always be propagated to the k-nearest available nodes when a tile is uploaded. Non-local tiles can be expired on some kind of LRU-type basis.
Nodes may throttle requests to their bandwidth limit if it is configured.
Handling cache misses
If a node does not have a requested tile in its cache, it must identify the k-local nodes for that tile and proxy the request for that tile to each of those nodes. The first request to return a valid result is used; the others are ignored.
When a node downloads a tile from the storage network, it *should* check the key before caching it or returning it in response to a request. Hopefully the key check will be fast and the number of public keys to be cached small. To ensure speedy responses to clients, the node might return the tile immediately, then run the signature check, and, if it fails, refuse to cache the tile. As another optimization, a node might check a small sample of tiles returned from a given peer, and ignore that peer in the network for a significant timeout if any of that peer's tiles fail to pass muster. Nodes may elect to signature check new tiles that are local to them before caching.
When a node receives a request for a tile that is local to that node *but* the node does not have the tile in its cache, *and* the requested layer is from a WMS (etc.) source, it should request that tile from the network.
If no other storage node returns the tile immediately (e.g. <100ms), and the requested grid cell is not at a larger scale than the layer itself, the node should request the tile from the source WMS (etc.) service, then sign the tile and upload it to the storage network. (This implies that source file storage nodes should be equipped with on-the-fly reprojection, whether it's gdalwarp, MapScript, or something. Too bad that MapScript is memory leaky, at least in Python.)
If the node doesn't have the tile *and* the requested layer sources are stored in the network, the node should look up the location of the network source file and request the corresponding tile from the node(s) that serve that file. If the node *has* the source file but not the tile, it should be prepared to extract the tile from the source file, reprojecting if necessary, sign the tile, return it, and then push the tile into the network.
If a node receives a web service request for such a tile, it may return the tile data first before signing and uploading.
A node may always return a "tile not found" message if the storage network does not return the tile within a reasonable timeout, e.g. 2-5 seconds, or if the request is outside the layer's bounding box. For web service requests, a corresponding redirect to a static "imagery not available at the present time" image hosted at www.openaerialmap.org should be returned, but these images should never be cached in the network.
Clients should be able to request tiles via web service APIs for the virtual "global" layer or any single tile layer by its metadata server ID. Storage nodes should provide capabilities metadata for all layers in the network, at least until the number of layers grows very large. (Then what?)
Care should be taken in the network design to avoid feedback loops, particularly when nodes are full.
Maintaining storage redundancy
Every n seconds, each node should randomly choose a cached tile and push it to the k-nearest available nodes local to that tile. The timeout for this update should be based on some heuristic determined by the node's configured bandwidth capacity.
Again, care should be taken to avoid feedback loops on storage updates.
Uploading tiles to the network
A user may upload tiles with an upload utility. The user must download their public key from the metadata server in order to upload tiles. The upload utility will warp the source images on the client side, extract tile images, add metadata, and sign each tile, before uploading them to the storage network.
Uploading source imagery to the network
Users should be able to upload untiled source imagery to the storage network, which are keyed by a structured URI describing their layer ID and bounding box. To avoid overloading any one storage node, source images should be distributed randomly by a hash of the URI, rather than geographically, across the storage network.
Likewise, some kind of file format and file size limitations should be placed on source images for sanity. Source imagery can be in any projection, so long as the source images have the projection encoded in the image file metadata. The upload utility should be prepared to chunk up the file along smaller scale tile boundaries in the source format and projection small enough to upload this way.
When a source image is uploaded to the network, the corresponding image tiles should but do not need to be uploaded also, as in principle the network can extract tiles from source files as needed,. If a user has the bandwidth and CPU to upload tiles *as well as* source images, this would be ideal.
As a speed optimization, the storage network might convert the source image in its original projection to an internally tiled, lossless format, such a tiled LZW-compressed GeoTIFF, so that generating images tiles on the fly happens faster.
Information about each source image uploaded to the network should be registered with the metadata server, and linked to the corresponding layer, as described above.
Performance and Reliability
The original version of this proposal suggested the use of a variant on Distributed Hash Tables (DHT) as the discovery layer for the network. An accomplished network engineer has pointed out that this approach may be a mistake to pursue, due to the complexity of DHTs, the relatively high latency involved in DHT lookups, and the relative paucity of P2P networks that successfully use DHTs. Accordingly, the DHT component of this proposed has been dropped in the latest version.
The geographic nature of the caching lookup will be integral to the storage network's performance. In essence, the algorithm proposed computes a sort of weighted Voronoi diagram on the surface of the Earth at each zoom level, where the storage nodes are the internal points of each face. Because all grid cells that contain a node are considered local to that node, small scale (and presumably more frequently requested) tiles will tend to get stored at larger numbers of nodes, while larger scale tiles will be stored at smaller numbers of nodes.
The eventual size of the node and layer lists is a concern, especially when both climb above 10,000 items. Fortunately, this outcome is a long way off, and differential updates combined with HTTP compression will also ease the burden. However, differential updates also raise the question of what becomes of the network when nodes get out of sync.
An even wackier idea might be to throw away the node list updates, and instead implement a DNS server that divides the Earth into (say for example) 16x16 = 256 storage zones. This "GeoDNS" server could then just return a capacity-weighted, round-robin subset of the k-local nodes for each zone, with a low enough TTL that nodes and clients get the benefit of caching, but get reasonably fast updates when nodes go offline. Furthermore, a smart client (e.g. OpenLayers, TileCache, GeoWebCache) could be optimized to request tiles directly from their storage zones, rather than using an arbitrary proxy node or merely using one that's local to the client.
Another optimization that might be considered is the appointment, where possible, of "super-peers" hosted at institutions with large storage and bandwidth capacity. These nodes could be registered with the network at *multiple* geographic locations, thus easing the potential load on smaller regional servers. Smaller servers could also opt out of hosting source imagery in order to optimize space for serving actual tiles.
Additional Reading
- Distributed Tile Caching page on the OSGeo wiki from November 2006
- A distributed tile caching model from the OSGeo wiki, also from November 2006.
