EP2145362A2 - 346infrastructure informatique - Google Patents

346infrastructure informatique

Info

Publication number
EP2145362A2
EP2145362A2 EP08746717A EP08746717A EP2145362A2 EP 2145362 A2 EP2145362 A2 EP 2145362A2 EP 08746717 A EP08746717 A EP 08746717A EP 08746717 A EP08746717 A EP 08746717A EP 2145362 A2 EP2145362 A2 EP 2145362A2
Authority
EP
European Patent Office
Prior art keywords
shadows
nodes
scram
node
master
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP08746717A
Other languages
German (de)
English (en)
Other versions
EP2145362A4 (fr
Inventor
David Duchesneau
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SCRUTINY Inc
Original Assignee
SCRUTINY Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SCRUTINY Inc filed Critical SCRUTINY Inc
Publication of EP2145362A2 publication Critical patent/EP2145362A2/fr
Publication of EP2145362A4 publication Critical patent/EP2145362A4/fr
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/161Computing infrastructure, e.g. computer clusters, blade chassis or hardware partitioning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5072Grid computing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/10Network architectures or network communication protocols for network security for controlling access to devices or network resources
    • H04L63/105Multiple levels of security
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S40/00Systems for electrical power generation, transmission, distribution or end-user application management characterised by the use of communication or information technologies, or communication or information technology specific aspects supporting them
    • Y04S40/20Information technology specific aspects, e.g. CAD, simulation, modelling, system security

Definitions

  • Datacenters represent a single point of failure (the datacenter itself), regardless of the level of internal redundancy Thus, a single regional disaster or terrorist attack may effectively destroy companies whose livelihood depends on a failed (or destroyed) datacenter Of course, a datacenter may fail without actually being attacked
  • the system establishes a supercomputing platform that meets the needs of the broad but unaddressed market for high-performance computing - both in terms of total throughput, but also in terms of low-latency (although not every problem or customer necessarily requires low latency) - while achieving unprecedented levels of affordabilitv (both capital and operational expense)
  • affordabilitv both capital and operational expense
  • the system was intentionally designed to operate in a "lights-out,” “hands-off”, “unattended,” “maintenance-free” environment
  • the system is designed to operate unattended, in low-cost remote locations - for years
  • the system incorporates design concepts that address next-generation needs, so that supercomputer manufacturing and production can occur on a scale appropriate for a radically enlarged addressable market, resulting in the lowest possible cost structure, while maximizing production flexibility
  • This sort of quantum reduction in cost structure is preferred, if only to achieve the levels of affordability necessary to jumpstart the otherwise unaddressable broadest markets
  • SHADOWS as a Distributed, Decentralized Centralized Architecture
  • SUREFIRE Sites as Survivable Mini-Datacenters
  • How Distributed Machines Are Organized at Multiple Sites 27
  • SERVANT Service Executor, Repository, & Voluntary Agent - Non-Trusted
  • MARSHAL Multi-Agent Routing, Synchronization, Handling, & Aggregation Layer
  • DELEGATE Distributed Execution via Local Emulation GATEway
  • SCRAM - Survivable Computing, Routing, & Associative Memory 31
  • SCRAM Subsystem 34 SCRAM Processing Node 34 SELF - Secure Emergent Learning of Friends 34
  • ADSL Asymmetric DSL.
  • AFR Annualized Failure Rate. The percentage of disk drives in a population that fail in a test scaled to a per-year estimation
  • the AFR of a new product is typically estimated based on accelerated life and stress tests, or based on field data from earlier products (commonly on the assumption that the drives are 100% powered on)
  • AFR estimates are typically included in vendor datasheets (e g , 0 88%, 0 73%, 0 63%, etc , for high quality disks with MTTFs in the range of 1 million hours to 1 4 million hours)
  • vendor datasheets e g , 0 88%, 0 73%, 0 63%, etc , for high quality disks with MTTFs in the range of 1 million hours to 1 4 million hours
  • the datasheet AFR and the field AFR differ widely According to a 2006 study of large-scale supercomputer clusters and ISPs that analyzed drives over a 5-year period, the field AFR exceeds 1 %, with 2% to 4% common, and up to
  • API Application Programming Interface
  • ASIC Application-Specific Integrated Circuit.
  • a hardware device containing fixed (not reconfigurable) logic and other circuitry See also FGPA
  • Availability ability of a component or service to perform its required function at a stated instant or over a stated period of time It is usually expressed as an availability ratio or percentage, i e , the proportion of time that a system can be used for productive work Availability over a particular period of time is calculated as ((Total_T ⁇ me - Unava ⁇ lable_T ⁇ me) I Total_T ⁇ me), where the Unava ⁇ lable_T ⁇ me is closely related to the MTTR Availability goals are often expressed as 99 9% ("three nines" availability), or 99 99% (“four nines” availability), and so on "Three nines" (99 9%) implies a maximum downtime of 8 hours and 46 minutes per year, whereas "four nines" (99 99%) limits the downtime to 53 minutes per year * Also: In the context of security, availability refers to the property of a system or a system resource that ensures it is accessible and usable upon demand by an authorized system user Availability is one of the core characteristics of a secure system See also AFR,
  • BB-RAM Battery-Backed RAM.
  • Alternative NVRAM implementations e g , MRAM
  • the SHADOWS CHARM technology uses diverse NVRAM technologies, including BB-RAM However, CHARM's primary rationale for using BB-RAM (in conjunction with the FIRE and NEAR technologies) is to be able to better control its secure erasure in case of intrusion, in addition to its traditional NVRAM use See also CHARM, FIRE, MRAM, NEAR, NVRAM, RAM
  • Bloom Filter A probabilistic algorithm to quickly test membership in a large set using multiple hash functions into a single array of bits Bloom filters are a space-efficient set membership structure whose key characteristic is to never yield false negatives, although it may yield false positives Bloom's algorithm lets one determine whether a value has been previously handled (e g , processed, stored, etc ), although one must allow the possibility of a false positive when extracting this information
  • the N-bit array starts out reset (FALSE), before we start to populate it with data As the bit array becomes more populated, sometimes we may set a bit that has already been set This is the root of the false positive cases we examine
  • phase of a Bloom filter is defined to be the number of times we attempt to set a bit in the array (5 to 20 times in this example) Both of these variables can be modified to change the accuracy and capacity of the Bloom filter Generally speaking, the larger the size of the bit array (N) and the higher the phase, the smaller the probability of false positive responses occurring Statistically, the chance of a false positive can be determined by taking the saturation (which decreases with increasing filter size) and raising it to the power of the phase
  • BOSSTM Byzantine Object & Subject Security.
  • TBC trusted computing base
  • object/subject security system that incorporates Byzantine agreement logic (from the classic "Byzantine generals" problem) in its decision-making process, and collectively makes security decisions in a "fail-silent" manner that provides survivability even in the face of multiple failures and/or corrupted nodes
  • BOSS is implemented and instantiated only in conjunction with a MASTER, and works in conjunction with CHARM to control who gets access to what, and when, while ensuring that unauthorized information is not exposed (not even to other internal systems)
  • BOSS is designed to enable the SHADOWS infrastructure to support both classified and unclassified information processing and storage (e g , to meet or exceed Common Criteria (CC) Protection Profiles (PP) such as the U S DoD Remote Access Protection Profile for High Assurance Environments Version 1 0 June 2000 nominally at EAL5 or potentially at EAL6 if implemented by a single qualified development organization) Any BOSS node that fails
  • Cache A small fast memory (relative to a larger, slower memory) holding recently accessed data, designed to speed up subsequent access to the same data Most often applied to processor-memory access but also used for a local copy of data accessible over a network, etc See also Cache Conflict, Direct Mapped Cache, Fully Associative Cache, Set Associative Cache
  • Cache Line (Or cache block) The smallest unit of memory than can be transferred between the main memory and the cache Rather than reading from a larger but slower memory one word or byte at a time, each cache entry is usually holds a certain number of words, known as a "cache line” or "cache block,” and a whole line is read and cached at once This takes advantage of the principle of locality of reference - if one location is read then nearby locations (particularly following locations) are likely to be read soon afterwards See also Cache
  • Caching is a useful general technique that sometimes makes programs run faster It does this by exchanging space for time Caching tries to save previously accessed or computed results in an attempt to reuse them later rather than recomputing them
  • Caching is useful in all kinds of situations, including, almost any kind of searching (cache the results of the search so that you can skip it next time), HTML generation (cache the results of the generation process so that you don't have to generate the page next time), and numeric computation (cache the results of the computation) See also Cache, FACTUAL, Memoization
  • CHARMTM Compressed Hierarchical, Associative, & Relational Memory.
  • An associative memory system that provides high-capacity, highly available and survivable persistent storage, and secure, rapid recall of information Incorporates local FIREblades and NEARblades (or FIREd ⁇ ves and NEARd ⁇ ves), and collaborates with CHARM systems of other local and remote SCRAM nodes Unlike relational databases, all information is indexed, and, from a hardware perspective, all index storage is electronic (no latency due to spinning media), but without the expense associated with general-purpose SSD Unlike relational databases, records or objects not meeting security constraints are never even accessed (e g , in order to check security attributes) Hard disks store only fractional archival data All geographically colocated equipment and data are expendable without information loss or disclosure See also FACTUAL, FIRE, NEAR, SCRAM, SSD
  • CNC Computer Numerical Control.
  • NC simply Numerical Control
  • Codec A complementary pair of functions comprising an encoder and a decoder, such that an input can be provided to the encoder, the output of which is fed to the decoder, the output of which is the original input See also FEC
  • CORETM Communications, Optimization, & Reasoning Engines.
  • engines whose purpose is to encapsulate and securely execute high-performance and/or hardware-assisted general purpose implementations of critical compute-intensive processes, to minimize latency, maximize overall throughput, and reduce operational costs
  • engines e g , FACTUAL, FASTpage
  • FACTUAL, FASTpage are typically closely associated with other systems, and are thus described in those contexts See also FACTUAL, FASTpage
  • CPU Central Processing Unit
  • CRC Cyclic Redundancy Check
  • Critical Heat Flux The heat flux beyond which boiling cannot be sustained because the liquid working fluid no longer wets the surface of the heat source Since heat flux is typically given in of W/cm 2 , those skilled in the art may recognize that the critical heat flux can be extended by numerous means, including any that increase the effective surface area of the heat source (the denominator) for a given level of power (the numerator), but also including the use of increased turbulence, special coatings, etc See also Heat Flux, RUBE
  • DDoS Distributed DoS.
  • a DoS involving distributed attack sources such as when multiple compromised systems cooperate to flood the bandwidth or resources of a targeted system, usually one or more web servers
  • the systems involved in a DDoS attack become compromised by the DDoS perpetrators via a wide variety of methods, and, depending on the nature and extent of the compromise, may contain relatively static hardcoded attack vectors (e g , "MyDoom," which involved a hardcoding the IP address in advance of an attack), or may contain sophisticated control mechanisms such that the compromised systems collectively form one or more "botnets " Unlike hardcoded attacks, botnets can be controlled dynamically, and thus targeted at any IP address at any time
  • the DDoS strategy provides an attack perpetrator with numerous advantages 1) orders of magnitude more attack traffic than a simple DoS, 2) increased stealth and detection avoidance, and 3) significant defense challenges for the targeted victims, since the aggregate bandwidth of a large botnet can easily any practical amount of surplus bandwidth a defender might purchase to mitigate DDoS attacks (and perpetrators can
  • DELEGATETM Distributed Execution via Local Emulation GATEway.
  • a distinguished SERVANT node having the responsibilities of fulfilling a DELEGATE role
  • the DELEGATE role implements secure client-side "proxy" agent that appears to locally implement a particular service which would normally be implemented elsewhere, such as on a local or remote server, but instead is actually implemented within the SHADOWS network cloud
  • the DELEGATE concept is described further in section 5 2 See also API, DBMS, DNS, LDAP, MPI, P0P3, RUSH, RUSHrouter, SIP, SLA, SMTP, VoIP
  • delta Compression attempts to recognize the differences between successive or nearly successive versions of a file, in order to express one in terms of the other plus some differences (the "delta") This works well in practice because the reference object ( ⁇ e , the earlier file version) is already known, and the relationship between the files is already established - the files are expected to be similar, so processing is straightforward and efficient
  • the compression ratio is excellent, such that storing two nearly identical files with delta compression would consume little more space than storing one of them (the ideal is reached when the files are identical)
  • applying the technique much more widely - such that an arbitrary data sequence can be efficiently expressed in terms of some other arbitrary data sequence whose identity is unknown but must be efficiently discovered during the compression process - is a
  • Direct Mapped Cache A cache where the cache location for a given address is determined from the middle address bits If the cache line size is 2" then the bottom n address bits correspond to an offset within a cache entry If the cache can hold 2 m entries then the next m address bits give the cache location The remaining top address bits are stored as a "tag" along with the entry In this scheme, there is no choice of which block to flush on a cache miss since there is only one place for any block to go This simple scheme has the disadvantage that if the program alternately accesses different addresses which map to the same cache location then it may suffer a cache miss on every access to these locations This kind of cache conflict is quite likely on a multi-processor See also Cache, Fully Associative Cache, Set Associative Cache
  • DMZ De-Militarized Zone
  • this is a buffer zone situated between a protected LAN and a WAN, and occupied by bastion servers, firewalls, or other devices that are sufficiently hardened so as to safely withstand direct exposure to the Internet
  • DoS attack In computer security, a denial-of-service attack (DoS attack) is an attempt to make a computer resource unavailable to its intended users
  • a DoS attack is characterized by an explicit attempt by attackers to prevent legitimate users of a service from using that service, typically via one of the methods 1) consumption of computational resources, such as bandwidth, disk space, or CPU time, 2) disruption of configuration information, such as routing information, 3) disruption of physical network components Attacks can be directed at any network device, including attacks on network routers and servers (e g , web servers, email servers, DNS servers, etc )
  • Examples of DoS attacks include 1 ) flooding a network, thereby preventing legitimate network traffic, 2) disrupting a server by sending more requests than it can possibly handle, thereby preventing access to a service, 3) preventing a particular individual from accessing a service, and 4) disrupting service to a specific system or person See also DDoS DRAM.
  • Dynamic RAM Volatile RAM that is characterized by its need to be continually refreshed, in order to prevent data loss, and by it relatively high storage density, relatively high performance, and relatively low cost
  • DRAM dynamic RAM
  • the dynamic nature of DRAM, in conjunction with its rapidly increasing storage density, creates a situation where there's a significant probability of multiple SEUs coinciding in the same access, resulting in data loss that is detected by cannot be corrected, or worse yet, undetectable data loss See also CHARM, MRAM, NVRAM, RAM, SEU, SRAM
  • DSL Digital Subscriber Line.
  • EAL Evaluation Assurance Level.
  • a package consisting of assurance components from the Common Criteria (CC), Part 3 that represents a point on the Common Criteria predefined assurance scale (e g , EAL5 or EAL6)
  • Emulate Generally, to imitate exactly Specifically, the capacity of one computer system to imitate another, or to imitate the interfaces and environment of another, such that relative to a particular set of expectations there is no difference between the emulator and that which it emulates See also Simulate
  • Emulation The situation in which one computer behaves like another, or imitates another's interfaces and environment See also Simulation
  • FACTUALTM Frequency-Adaptive Computation Table & Use-Adaptive Lookup.
  • a process-oriented memoization ("memo table") capability that retrieves previously computed, "vetted” results for arbitrary deterministic processes and functions All values that can affect the output (including the identification of the exact process and any parameters) may be provided as input, along with a timeout value and a list intended recipients, and a signed and certified result can be sent to them
  • FACTUAL implements a race ("looking up” vs "recomputing" vetted results), but lookup typically starts before the request even reaches the head of the request queue for the target process In the event a process starts due to timeout, if there's a "hit” and the looked-up result becomes available in time, it may be used as an oracle instead, to check the process Misses cause no latency penalty
  • FACTUAL is a global, process-based capability that takes advantage of the persistent associative memory of the CHAR
  • FASTpageTM Fast Associative Search Tree, pageable.
  • a fast, highly scalable, associative memory mechanism that can adapt to the information to be remembered, in order to optimize both time and space FASTpage index size is limited only by the availability of system-wide resources
  • FASTpage is well-suited to both transient in-memory data (generally faster than hash-table searching) and persistent data (designed for extremely fast searches with indexes in flash memory)
  • the FASTpage search mechanism is based on a hybrid comprising a novel "pageable" Ternary Search Tree (TST) (having compressed, variable-length nodes and exhibiting locality of reference) and a novel "pageable" digital search Trie (having vectored, compressed, variable-length off-page references)
  • TST Ternary Search Tree
  • Trie novel "pageable" digital search Trie (having vectored, compressed, variable-length off-page references)
  • the FASTpage storage mechanism is optimal for flash-based storage, and intended for use in hierarchical memory systems
  • FEC Forward Error Correction
  • FEC Forward Error Correction
  • CHARM uses FEC for communications, and also for data distributed both locally and remotely on both transient media (e g , in DRAM) and persistent media (e g , flash or magnetic storage) Because CHARM's storage formats are already FEC-encoded, stored data can be transmitted without further FEC encoding, as appropriate
  • the FEC codecs are implemented as part of the CORE functions embedded in the PUMP devices
  • the general purpose processors in SHADOWS also implement the FEC codecs See also CHARM, Codec, CORE, DRAM, ECC, FIRE, NEAR, PUMP
  • FIRETM Fast Index & Repository Emulator.
  • the technology underlying a FIREbladeTM or FIREdriveTM, and used by CHARM as its primary online persistent storage FIRE combines quickly securable DRAM and BB-RAM for high-speed storage of rapidly changing data, with the DRAM used for caching already-stored (and therefore expendable) in-the-clear data " , for example, and the BB-RAM used for buffering committed transactional data (in conjunction with a set of other suitably distributed instantiations of FIRE with which it communicates) not yet written to long-term, persistent data storage
  • FIRE uses flash-based storage (or its equivalent) rather than magnetic storage, and this provides high-performance all-electronic, long-term, persistent data storage that is immune to mechanical wear and vibration (including seismic events)
  • the flash-based storage also operates at very low power (typically less than 1 microwatt per lOPS, vs more than 20 to 40 milliwatts per IOPS for low-power and/or high-
  • FLAMERouterTM Firewall, Link-Aggregator/Multiplexer & Edge Router, (aka FLAMER)
  • a special MASTER that serves as a gateway and tunneling router between the LAN fabrics of a SHADOWS node and one or more wide-area networks (WANs) Automatically tunnels SHADOWS communications protocols (e g , RUSH, RECAP, UNCAP) over existing LAN and/or WAN protocols as necessary See also HSLS, RUSHrouter, LAN, Tunneling Router, WAN
  • FORCETM Frictionless Organic Rankine Cycle Engine.
  • a kilowatt-scale turboalternator (heat engine) consisting primarily of an efficient, low-temperature (130 0 C), low-pressure (6-8 bar) vapor turbine connected to an alternator
  • the FORCE turboalternator has only one moving part (the shaft), which spins at very high speed (e g , nominally at 62,000 RPM in a preferred embodiment) and rides on hydrodynamic "vapor bearings" - essentially a vapor layer created by its rotating foils, due to the Bernoulli effect
  • the vapor bearings are optimally advantageous to reduce friction losses to near zero during normal operation (there is still some residual friction due to colliding vapor molecules)
  • a novel optional embodiment of the FORCE turbine also engages separate pneumatic-like vapor bearings during spin-up and spin-down (but not during normal operation), and thus completely avoids any wear-inducing friction within the turbo machinery In the absence of separate spin-up/spin-down vapor bearings, the foils
  • FPGA Field-Programmable Gate Array.
  • a type of reconfigurable logic based in hardware, that may also include specialized, embedded devices to that provide enhanced functionality and/or performance while minimizing the use of reconfigurable logic gates Contrast with ASIC See also ASIC
  • FPSC Free Piston Stirling Cooler.
  • FPSE Free Piston Stirling Engine.
  • FRAMETM Forced Recuperation, Aggregation & Movement of Energy.
  • a power production and/or peak- shaving energy management capability whose goal is to reduce operational costs and enhance or enable survivability FRAME works by significantly reducing the energy required to operate a heat-dissipating system (such as a computing system), through the recuperative use of energy in general, and by time- shifting the generation and consumption of power to the most effective and/or efficient time-frames
  • FSLS Fuzzy Sighted Link State.
  • a family of wireless routing algorithms e g , for a wireless mesh that depend on the observation that changes in links that are far away ( ⁇ e , relative to the mesh) are less
  • a MASTER assigns virtualized computing, storage, and communications resources to a set of SLAVEs over which it has authority, and a HANDLER implements the physical interfaces of these resources
  • the HANDLER interfaces and logic are implemented within the SLAVE PUMP dev ⁇ ce(s) to which the SLAVE processors are attached, such that the HANDLER hardware provides functionality similar to a software-based isolation kernel
  • the HANDLER'S hardware implementation supports dedicated per-process registers and FIFO devices that enable user- space input/output without system call overhead, within the security constraints set by the MASTER See also MASTER, PUMP, SELF, SERVANT, SLAVE
  • Heat Flux The flow rate of heat across or through a material, or the quantity of thermal energy transferred to a unit area per unit time, typically given in units of W/cm 2 See also Critical Heat Flux, RUBE
  • HMAC Hashed Message Authentication Code. (Sometimes just MAC, although this can be confusing) A one-way hash computed from a message and some secret data, for the purpose of detecting whether a message has been altered It is difficult to forge without knowing the secret data See RFC 2402 See also MAC
  • HSLS Hazy-Sighted Link State.
  • a routing algorithm invented by researchers at BBN Technologies) in the family of wireless routing algorithms called FSLS Its designers sought to minimize global network waste, the total overhead of which they defined as, "the amount of bandwidth used in excess of the minimum amount of bandwidth required to forward packets over the shortest distance (in number of hops) by assuming that the nodes had instantaneous full-topology information "
  • the network overhead associated with HSLS is theoretically optimal, utilizing both proactive and reactive link-state routing to limit network updates in space and time, and on larger networks HSLS begins to exceed the efficiencies of the best-known other routing algorithms Unlike traditional methods, HSLS does not flood the network with link-state information to attempt to cope with moving nodes that change connections with the rest of the network, nor does it require each node to have the same view of the network
  • a variant of HSLS may be used generally within any subsystems where distributed resource information is relevant to the distribution of resource flows (e g , information,
  • I2P Heating, Ventilation, & Air Conditioning.
  • I2P "Garlic Router”. An open-source, anonymizing overlay network based on establishing secure, multi- hop connections among intentionally selected I2P nodes
  • I2P incorporates lessons learned from TOR, an alternative anonymizing network that predates it
  • I2P is fundamentally a packet switched network
  • TOR is fundamentally a circuit switched one, allowing I2P to transparently route around congestion or other network failures, operate redundant pathways, and load-balance the data across available resources TOR and I2P complement each other in their focus - TOR works towards offering high speed anonymous Internet outproxying
  • I2P works towards offering a decentralized, resilient, low-latency network in itself
  • One goal of I2P is to achieve appropriateness for use in hostile regimes against state-level adversaries
  • I2P uses a technique called "garlic routing" - layered encryption of messages, passing through routers selected by the original sender I2P sends messages by taking a message, encrypt
  • IGMP Internet Group Management Protocol. A standard protocol for managing multicast groups
  • Integrity The quality of an information system reflecting the logical correctness and reliability of the operating system, the logical completeness of the hardware and software implementing the protection mechanisms, and the consistency of the data structures and occurrence of the stored data In a formal security mode, integrity is interpreted more narrowly to mean protection against unauthorized modification or destruction of information
  • IP Internet Protocol. See RFC 791 and RFC 2460 See also TCP, UDP
  • ISP Internet Service Provider.
  • LDAP Lightweight Directory Access Protocol. See also DELEGATE
  • MAC Message Authentication Code. Used to validate information transmitted between two parties that share a secret key Also Media Access Control, the globally unique hardware address of an Ethernet
  • MASTERTM Multiprocessor Adaptive Scheduler & Task Executor/Redirector.
  • a MASTER may have a number of dedicated, trusted, attached (and therefore local) SLAVE resources over which it enjoys complete control, via a HANDLER, and any number of "volunteer" SERVANT resources that are not trusted See also BOSS, HANDLER, SELF, SERVANT, SLAVE
  • MARSHALTM Multi-Agent Routing, Synchronization, Handling, & Aggregation Layer.
  • a distinguished SERVANT node having the responsibilities of fulfilling a MARSHAL role Any node, authenticated as having a MARSHAL role, that serves as a gateway for system users to access SHADOWS services via a network (e g , the Internet)
  • a MARSHAL may also communicate with other MARSHALS, under the auspices and control of a MASTER-led team, in order to implement one or more overlay networks and/or network fabrics whose purposes and characteristics are determined by the MASTER-led team (but are opaque to the MARSHALS)
  • a MARSHAL is not trusted, and the role is typically fulfilled by a SERVANT node (which is also inherently untrusted)
  • a SLAVE emulating a MARSHAL
  • SLAVE emulating a MARSHAL
  • MBps Mega-Bytes per second. A measure that often refers to a storage device throughput rate, in millions of bytes per second, where 1 byte equals 8 bits
  • a storage device may have a peak rate that is constrained by its interface, and this rate is normally achieved only for short bursts, when the associated read or write request can be satisfied via the device's cache memory
  • a storage device also has a sustained rate that corresponds to the maximum rate at which the device can continuously read or write data, and this rate is tied to the accessibility of the underlying storage media ( ⁇ e , the media rate) See also Mbps
  • erasures are symbols missing from known locations ( ⁇ e , the symbols are not known, but their position is) If, instead of e erasures, there are up to f faulty symbols, but their positions are unknown, then a system that can correct up to e erasures can correct at most f
  • Memoize To modify a function such that re-computation of previously computed results is avoided in favor of retrieving and substituting the previously computed results themselves Memoization essentially augments a computational function with a cache of previously computed results, indexed by the arguments of ( ⁇ e , inputs to) the previous computations Memoization, since it is based on caching, therefore trades space for time Memoization is only appropriate for pure functions (one with no side effects, whose return value depends only on the values of its arguments) Memoization is useful in all kinds of situations, including almost any kind of searching (cache the results of the search so that you can skip it next time), HTML generation (cache the results of the generation process so that you don't have to generate the page next time), and numeric computation (cache the results of the computation) The word "memoize” was coined by Donald Michie in 1968 See also Cache, Caching, FACTUAL, Memoization
  • Memoized Function A function that remembers which arguments it has been called with and the result returned and, if called with the same arguments again, returns the result from its memory rather than recalculating it
  • a memoized function ( ⁇ e , one with caching) may run faster than one without caching, but it uses up more memory This same principle is found at the hardware level in computer architectures which use a cache to store recently accessed memory locations See also Cache, Caching, FACTUAL, Memoize
  • MRAM Magnetic RAM.
  • NVRAM Magnetic RAM.
  • BB-RAM NVRAM
  • RAM Random Access Memory
  • MTBF Mean Time Before Failure. Older but synonymous term for MTTF See also MTTF, MTTR
  • MTTF Mean Time To Failure.
  • the MTTF is estimated as the number of power-on hours per year (usually assumed at 100% power on) divided by the AFR
  • a server-class disk drive with a manufacturer-specified AFR of 0 63% would have an estimated MTTF of about 1 4 million hours
  • PC-class disk drives typically have much lower AFR values, which are also calculated with a much lower number of power-on hours per year Note, however, that even for server-class drives, observed AFR values in the field exceed 1 %, with 2% to 4% common, and up to 12% observed in some systems, so the estimated MTTF needs to be carefully considered See also AFR, Availability, MTBF, MTTR
  • MTTR Mean Time To Repair. The average time (usually determined through empirical measurement) required to restore service after a breakdown or loss See also Availability, MTTF
  • NaN NaN. Not a Number.
  • NaN is not the same as infinity, although both are typically handled as special cases in floating-point representations of real numbers as well as in floatingpoint operations
  • An invalid operation is also not the same as an arithmetic overflow (which might return an infinity) or an arithmetic underflow (which would return the smallest normal number, a denormal number, or zero)
  • a NaN does not compare equal to any floating-point number or NaN, even if the latter has an identical representation
  • One can therefore test whether a variable has a NaN value by comparing it to itself ( ⁇ e if x ⁇ x then x is NaN)
  • IEEE floating-point standard IEEE floating-point standard (IEEE 754),
  • NEARTM Nearline Emulation & Archival Repository. Used by CHARM for nearline storage, NEAR is the technology underlying a NEARbladeTM or NEARd ⁇ veTM It provides high-capacity, electronically assisted long-term data storage that is subject to minimal mechanical risk (including wear, vibration, and seismic events), due to significantly reduced mechanical duty cycle
  • the NEAR technology attempts to minimize the number of spinning disk drives while providing full accessibility to data Prior to spin-down, the NEAR technology performs extensive analysis and maintenance, after which it may reconfigure the system as necessary in accordance with the analysis and maintenance results
  • Data stored in NEAR is safe from intruders even if stolen As a fringe benefit of the NEAR storage approach, the number of read and/or accesses per second is orders of magnitude faster than unassisted hard disk drives See also CHARM, FIRE, SMART
  • NVRAM Non-Volatile RAM Contrast with DRAM and SRAM, which are volatile BB-RAM and MRAM are types of NVRAM See also BB-RAM, DRAM, MRAM, RAM, SRAM
  • Object An entity that contains or receives information and upon which subjects perform operations
  • Packet An ordered group of data and control signals transmitted through a network as a subset of a larger message
  • Packet Switching A communications paradigm in which packets (messages or fragments of messages) are individually routed between nodes, with no previously established communication path Packets are routed to their destination through the most expedient route (as determined by some routing algorithm) Not all packets traveling between the same two hosts, even those from a single message, necessarily follow the same route Page Fault.
  • Page Fault The condition that occurs in a virtual memory system when there is an attempt to access a virtual memory page that is not currently present in physical memory See also Demand Paging, Swapping
  • PERKSTM Peak Energy Reserve, Kilowatt-Scale.
  • a peak-shaving system that directly captures excess or low-cost electrical energy from a multiplicity of sources (when it is cheapest or most readily available) and stores it for later reuse, such as during peak periods (when power is most expensive or less available)
  • UPS which remains charged “just in case”
  • the PERKS capability continually captures and discharges stored energy "just in time " as needed, so as to reduce the overall energy cost and maximize full-processing availability
  • UPS Low Energy Reserve
  • POL Point-of-Load.
  • Point-of-load (POL) DC-DC converters enable electronic developers to overcome the challenges caused by the high peak current demands and low noise margins of high-performance semiconductor devices, by placing individual, non-isolated, DC power sources near their point of use, thereby minimizing losses caused by voltage drops and ensuring tight voltage regulation under dynamic load conditions
  • POL devices also reduce noise sensitivity and EMI emissions by significantly shortening potential radiators and RF-susceptible conductors
  • Policy-based Management A method of managing system behavior or resources by setting "policies” (often in the form of "if-then” rules) that the system interprets
  • Protection Profile An implementation-independent set of security requirements and objectives for a category of products or systems which meet similar consumer needs for IT security
  • a PP is intended to be reusable and to define requirements which are known to be useful and effective in meeting the identified objectives
  • a reusable set of either functional or assurance components e g , an EAL
  • Information about Protection Profiles can be found on the Internet at http //www iatf net See also CC
  • Priority Loads In power plants where load management schemes are used, a priority is assigned to each load center Loads with the highest priority are powered first and shed last
  • PRNG Pseudo-Random Number Generator.
  • a mechanism for generating pseudo-random numbers on a computer They're called pseudo-random, because you can't get truly random numbers from a completely non-random thing like a computer
  • a pseudo-random number generator is a computational or physical device designed to generate a sequence of numbers that does not have any easily discernable pattern, so that the sequence can be treated as being random In reality, however, if a computer generates the number, another computer can reproduce the process Random number generators have existed since ancient times, in the form of dice and coin flipping, the shuffling of playing cards, the use of yarrow stalks in the I Ching, and many other methods See also Pseudo-Random Number, Random Number Generator
  • Protocol A formal set of conventions governing the formatting and relative timing of message exchange between two or more communicating systems or devices
  • Proxy A software agent, often a firewall mechanism, that performs a function or operation on behalf of another application or system while hiding the details involved See also FLAMERouter, MARSHAL, RUSHrouter
  • Proxy Server A firewall component that manages Internet traffic to and from a LAN and that can provide other features, such as document caching and access control
  • a proxy server can improve performance by supplying frequently requested data, such as a popular Web page, and it can filter and discard requests that the owner does not consider appropriate, such as requests for unauthorized access to proprietary files
  • Pseudo-Random Number One of a sequence of numbers generated by some algorithm so as to have an even distribution over some range of values and minimal correlation between successive values See also PRNG
  • PUMPTM Parallel Universal Memory Processor. See also HANDLER, MASTER, SLAVE
  • PV Photo-Voltaic
  • QoS Quality of Service
  • QoS Quality of Service
  • QoS Quality of Service
  • a group of service classes that define the performance of a given circuit
  • RAM Random Access Memory
  • NVRAM which is specifically non-volatile
  • RAM loses its content when power is turned off, but not so much that its cannot be reconstructed by a sophisticated (e g , state-funded) adversary (the longer that a particular memory bit maintains its value, the more recoverable it is by an adversary that gains physical access, so powering off, or making a few passes of rewriting random data really has little effect)
  • a sophisticated (e g , state-funded) adversary the longer that a particular memory bit maintains its value, the more recoverable it is by an adversary that gains physical access, so powering off, or making a few passes of rewriting random data really has little effect
  • the PUMP'S ability to maintain complementary states in memory - a technique where memory locations are invisibly toggled to and from their complementary states such that each state has a duty cycle of approximately 50% (which means that an adversary gaining physical access cannot determine previous contents after a power-off) See also BB-RAM, CHARM, D
  • RECAPTM Reliably Efficient Computation, Adaptation, & Persistence.
  • the RECAP protocol is never used in the clear, even locally, and it is assumed to be subject to Byzantine failures RECAP may be safely tunneled via other protocols, especially RUSH, but such tunneling is performed only by a specially authorized device called a FLAMERouter (described elsewhere), which also contains a MASTER RECAP is used for communication among what are "hoped" to be trusted parties, in contrast to UNCAP See also FLAMERouter, RUSH, Tunneling Router, UNCAP
  • RLE Run-Length Encoding.
  • runs of data that is, sequences in which the same data value occurs in many consecutive data elements
  • RLE-1 back-to-back encoding sequence
  • RLE-2 back-to-back encoding sequence
  • RLE-3 back-to-back encoding sequence
  • RNG Random Number Generator.
  • a random number generator is a computational or physical device designed to generate a sequence of numbers that does not have a pattern In theory, true random numbers only come from truly random sources, such as atmospheric noise and radioactive decay
  • recuperated energy heats and expands the working fluid (causing a phase-change to vapor if the temperature is sufficiently high), which, in conjunction with optional vapor injection, creates a motive force that helps to circulate the working fluid among system components (in order to thermally stabilize the system, to further extract re-usable energy for immediate reuse or storage, and to efficiently exhaust waste energy without overly subcooling the working fluid)
  • a small, continuous, positively pressurized liquid flow is maintained, ensured via a low-power pump means, in order
  • the RUBE Double-Boiler apparatus is part of a closed-loop system, that, in a preferred embodiment is connected to other components as shown in the figure, "RUBE - Heat Energy Recuperation Cycle Overview "
  • the RUBE Double-Boiler apparatus comprises an "inner boiler” and an “outer boiler,” such that the former is fully enclosed within the latter, in order to maximize the recuperation of heat energy (thermal energy) dissipated by the aggregation of enclosed heat sources, and optionally, to separate the recuperated heat energy into two or more "grades" according to desired or observed temperatures
  • the "hot" heat sources ⁇ e , those components with a relatively higher heat flux, such as CPUs
  • the "warm” heat sources ⁇ e , those components with a relatively lower heat flux, such as flash memory chips
  • Both the inner and outer boilers are pressure vessels intended to withstand a maximum of 7-
  • the RUBE Double-Boiler apparatus has an outer shell of cast aluminum (although other construction methods and materials are possible), and its external shape and form factor is such that it can mate with guide channels extruded into a vertically oriented cylindrical or partly cylindrical aluminum extrusion designed to contain a multiplicity of RUBE Double-Boiler units Given the aforementioned vertically oriented extrusion, the intent is to be able to easily align and slide the RUBE Double-Boiler apparatus from the extrusion upper opening, downward into the extrusion until it reaches a bulkhead, where couplings and connectors on the bottom of the Double-Boiler apparatus mate with complementary couplings and connectors within the extrusion
  • the RUBE Double-Boiler apparatus is a pressure-sealed, field- replaceable unit having blind-mating, quick-disconnect inlet and outlet couplings with double EPDM * seals, and capable of operating at 100 PSI, such as those available from Colder
  • the inner boiler apparatus is colocated with the "hot" surfaces (the surfaces with the largest heat flux) of the "hottest" of the heat-producing devices, which are so arranged that such placement is possible with a minimum (or otherwise convenient) number of manifolds*
  • the inner boiler apparatus is oriented vertically (although it is depicted horizontally here, for convenience) such that the liquid inlet O and vapor outlet ⁇ are at the top, and the liquid outlet ⁇ is at the bottom
  • the working fluid expands substantially when heated Since the inlet is check-valved, this expansion greatly pressurizes the heat exchanger and the working fluid is expelled through the outlet check-valve (where it makes its way to vapor outlet ⁇ and/or liquid outlet ⁇ ), thereby creating a partial vacuum within the heat exchanger ⁇ under discussion (which helps to pull in more
  • EPDM is prefeired for its compatibility w ith the pieferred working fluid
  • the RUBE Vapor Injector is a means to 1) maintain a load (the "boiler") within a desired temperature range, and 2) recuperate as much energy as possible from the heat dissipated by the load, in order to convert the recuperated heat energy into mechanical energy (specifically, pressure energy) that can be used as motive force to reduce or eliminate the energy that would otherwise be needed for circulation pumps in a phase-change heating, cooling, and/or power generation system
  • the working fluid may be an organic dielectric fluid with a boiling point between 20 0 C and 40°C, such as 1 -methoxy-heptafluoropropane (C 3 F 7 OCH 3 )
  • Other working fluids may also be suitable, some examples of which are listed in section 10.3.
  • the working fluid expands substantially when heated See also RUBE, RUBE Double-Boil
  • RUSHTM Rapid Universal Secure Handling.
  • a multi-level proprietary communications protocol that has both asynchronous and synchronous characteristics and can stand alone or be tunneled over existing WAN protocols (whether synchronous or asynchronous) RUSH is used as the primary carrier protocol among FLAMERouters, MARSHALS, and client-side RUSHrouter software or hardware RUSH can directly incorporate flows from the RECAP and UNCAP protocols, and also tunnels them, along with various industry-standard protocols
  • the RUSH protocol can take advantage of other protocols (e g , I2P, TOR) as necessary to prevent (or reduce the threat of) traffic analysis, and can also tunnel other protocols, for the same reasons
  • a key characteristic of RUSH is its propensity for simultaneously utilizing multiple network channels, interfaces, gateways, routes, etc , such that a single conceptual source and destination pair effectively becomes multiple targets and destinations that tend not to be apparently related unless an adversary has truly global visibility (in which case, such an adversary still faces a multiplicity of overwhelming cryptographic and traffic analysis challenges
  • each outbound channel interface (e g , a physical network interface, wireless adapter, etc ) has a dedicated RUSHrouter operating in its own VM, a separate RUSHrouter, also in its own VM, serves as the default gateway for the host computer, interfacing any hosted applications to the SHADOWS infrastructure by appropriately routing communications through the RUSHrouters that control the channel interfaces See also DELEGATE, FLAMER
  • SAS Serial Attached SCSI A disk drive interface standard that supersedes parallel SCSI and can accept either SAS or SATA disk drives See also SATA
  • SATA Serial ATA A disk drive interface standard that supersedes parallel ATA SATA disk drives be used with either SAS or SATA disk host adapters, but a SATA host adapter can communicate only with SATA drives See also SAS
  • SCADA Supervisory Control And Data Acquisition A category of mechanisms for process control that includes hardware and software components SCADA provides for the collection of data in real time from sensors and machines in order to control equipment and conditions, and typically includes transmitting the data to one or more central locations for logging and/or analysis
  • SCRAMTM Survivable Computing, Routing, & Associative Memory.
  • An individual SCRAM machine is intended to be self-contained and capable of operating on or off the electrical grid for extended durations, and without human attention or maintenance
  • a SCRAM machine is its own miniature datacenter that can be located in out-of-the way places such as underground, on a pole or roof, etc , as easily as in an office, warehouse, or datacenter See also CHARM, CORE, FRAME, SCRAMnet, SELF, SUREFIRE
  • SCRAMnetTM A SCRAM-based network comprising any number of geographically proximate MASTERS, SLAVES, and SERVANTS.
  • SCRAMnets are the basis of the SHADOWS infrastructure, but must always operate under the auspices of a distributed team that includes multiple MASTERS
  • Each SHADOWS node is a SCRAMnet, but not necessarily vice-versa
  • a SCRAMnet must meet specific requirements to become a SHADOWS node
  • Geographically proximate SERVANTS can organize themselves into SCRAMnets without having a local MASTER, but only for the purpose of establishing communication with the SHADOWS network, at which point they may be assigned to a MultiMASTER team (which always has multiple MASTERS, by definition) SERVANTS must be able to communicate with the SHADOWS network, either individually or collectively, and they may cooperate extensively to do so Any "Candidate MASTER" that is unable to establish itself as a MASTER ( ⁇ e , a full peer with other MASTERs) retain
  • ECC-DED Single Error Correction, Double-Error Detection A form of ECC that can correct a single memory error or SEU, and detect two See also DRAM, ECC, SEU
  • SELFTM Secure Emergent Learning of Friends.
  • An automated identity- and role-oriented "immune system” that differentiates “self and “non-self, “friend” and “foe” - 1 e , between authorized and unauthorized objects, subjects, and interactions
  • the focus of SELF is on the recognition of a relatively small set of correct behaviors rather than the recognition of any of an infinitely large set of counterfeit behaviors (by definition, all non-self behavior is assumed malicious)
  • SELF is the basis for establishing and maintaining trust among the interdependent systems, subsystems, and components in a SHADOWS infrastructure
  • SELF includes novel Byzantine agreement logic in its decision-making process SELF is highly integrated with BOSS (which is the definitive authority on trust and correctness), and with the RECAP, UNCAP, and RUSH protocols Any anomalous, "non-self behavior activates an appropriate immune system response See also BOSS, MASTER, RECAP, RUSH, SCRAMnet, SERVANT, UNCAP
  • Sensitive Information Information that, as determined by a competent authority, must be protected because its unauthorized disclosure, alteration, loss, or destruction can at least cause perceivable damage to someone or something (DoD 5200 28-STD) See also SBU, Sensitivity Label
  • Sensitivity Label A piece of information that represents the security level of an object and that describes the sensitivity (e g , classification) of the data in the object Sensitivity labels are used by the TCB as the basis for mandatory access control decisions (DoD 5200 28-STD)
  • DoD 5200 28-STD mandatory access control decisions
  • an object's sensitivity label (and other security properties) is available to CHARM (the SHADOWS associative memory system), and therefore to BOSS (the SHADOWS TCB), without having to retrieve the object itself, via its FASTpage index entries See also BOSS, CHARM, FASTpage, Sensitive Information, TCB
  • Set Associative Cache A compromise between a direct mapped cache and a fully associative cache where each address is mapped to a certain set of cache locations The address space is divided into blocks of 2 m bytes (the cache line size), discarding the bottom m address bits
  • An "n-way set associative" cache with S sets has n cache locations in each set Block Jb is mapped to set "b mod S" and may be stored in any of the n locations in that set with its upper address bits as a tag
  • set "b mod S” is searched associatively for the tag
  • a direct mapped cache could be described as "one-way set associative" ( ⁇ e , one location in each set), whereas a fully associative cache is W-way associative (where N is the total number of blocks in the cache) Performance studies have shown that is is generally more effective to increase number of entries rather than associativity and that 2- to 16-way set associative caches perform almost as well as fully associative caches at little
  • SEU Single Event Upset.
  • the primary goal of ECC mechanisms is to detect and/or correct the inevitable occurrence of one or more SEUs
  • Consumer computers rarely have ECC at all, but server computers often protect their main memory systems with SEC-DED ECC (which is capable of correcting a single error per access), and sometimes have a "ChipkiH" type of ECC that can detect a single chip failure and some multiple SEU combinations
  • SEUs are probabilistic, however, as memory capacities and densities increase, and as average chip temperature increase, the likelihood of SEUs increases even more quickly than one might expect SEU likelihood has now increased to the point that failure due to uncorrectable SEU is becoming a relatively common event, even when SEC-DED ECC is used See also CHARM, DRAM, ECC, FEC, SEC-DED
  • a SHADOWS network consists of a combination of terrestrial and space-based SHADOWS nodes and singleton SCRAM machines (described later, along with SERVANTS and MARSHALS)
  • a geographically proximate collection of SCRAM machines may self-organize into a geographically proximate SCRAMnet comprising a SHADOWS node SCRAM machines that are unable to join a SHADOWS node remain as singletons until they can join one, if ever Singletons act as SERVANTS to bona-fide SHADOWS nodes, and as MARSHALS between SHADOWS nodes and system users (however, non-singletons also volunteer for these roles on a part-time basis)
  • SIP Session Initiation Protocol. See also DELEGATE SLA. Service Level Agreement
  • SLAVETM Storage-Less Adaptive Virtual Environment.
  • SMART Self Monitoring Analysis & Reporting Technology. Also S.M.A.R.T.
  • a monitoring system and signaling interface for magnetic disk drives to detect and report on various indicators of reliability SMART enables a host processor to receive analytical information from the disk drive that may be useful for anticipating failures See also NEAR
  • a system using a relatively low-temperature phase-change working fluid to receive heat energy from the sun for immediate use acts as a "boiler" or subsequent use, and especially for the primary purpose of generating electricity
  • a system using a relatively low-vapor-pressure working fluid for example, an appropriate Paratherm thermal oil
  • the heat energy in this context refers to energy that can be immediately used immediately (or stored for later use) to effect or help effect a liquid/vapor phase-change, such as occurs, by design, in a "boiler " Received energy heats and expands the phase-change working fluid (which may have been preheated via RUBE, above), and which, in conjunction with optional vapor injection (see RUBE Vapor Injector, described elsewhere) in the "boiler" feed circuit, and in conjunction with a FORCE nanoturbine or FPSE (Free
  • Static RAM A type of volatile RAM whose cells do not need to be continually refreshed, but which may lose data if power is removed Contrast with DRAM and NVRAM See also DRAM, NVRAM, RAM
  • SSD Solid-State Disk
  • BB-RAM battery-backed RAM
  • flash memory flash memory
  • SUREFIRETM Survivable Unmanned Renewably Energized Facility & Independent Reconfigurable Environment.
  • SUREFIRE sites can be located on virtually any outdoors property, but also in basements or on rooftops, etc
  • SUREFIRE sites usually include one or more renewable energy systems, in addition to conventional energy sources SUREFIRE sites are designed for maximal energy efficiency, and emit very little waste heat All SUREFIRE sites may be expendable without data loss, and penetration can never yield useful information to an attacker See also SCRAM
  • Swap File A special file in a virtual memory system which is used to temporarily store "dirty" memory pages Swap files, although typically disk-based, are often organized for relatively rapid access compared to writing dirty pages back to their original location See also Demand Paging, Swapping
  • Swapping In a virtual memory system, a technique to remove virtual pages from physical memory in order to replace them with others that are currently needed "Dirty" pages (those which came from an executable image or data file and have been modified but not yet written back) are written to a "swap file" temporarily (unless they've been written previously and are unchanged, in which case they can simply be deleted) Non- dirty pages can simply be deleted, since they can be reread on demand Pages are swapped out only if the data in them cannot be retrieved another way See also Swap File
  • TCB Trusted Computing Base.
  • the TCB is a useful concept because it identifies, within a system, the subsystem which owns the security (in the SHADOWS infrastructure, BOSS implements the TCB) The rest of the components may communicate with this TCB and rely on it to make correct security decisions Thus, the TCB must exist and it must make 100% of the security decisions
  • the DoD defines the TCB as the totality of protection mechanisms within a computer system - including hardware, firmware, and software - the combination of which is responsible for enforcing a security policy
  • a TCB consists of one or more components that together enforce a unified security policy over a product or system
  • the ability of a trusted computing base to correctly enforce a security policy depends solely on the mechanisms within the TCB and on the correct input by system administrative personnel of parameters (e g , a user's clearance) related to the security policy (DoD 5200 28-STD)
  • TCSEC1983 defines the TCB as "
  • TCP Transmission Control Protocol.
  • TCP/IP IP over IP
  • TCS Trusted Computer System. A system that employs sufficient hardware and software integrity measures to allow its use for processing simultaneously a range of sensitive or classified information (DoD 5200 28-STD) TCSEC. Trusted Computer System Evaluation Criteria.
  • TDP Thermal Design Power.
  • T C ase For power-hungry integrated circuit chips, there is sometimes an observable or even specified relationship between T C ase and TDP See also T ca se
  • TOR The Onion Router.
  • Any SHADOWS nodes that implements the RUSH protocol can participate in the TOR network as a TOR node
  • SHADOWS does not depend on TOR, participating in the TOR network provides a source of mix-in traffic that helps to prevent traffic analysis by a sophisticated attacker, while also helping the TOR network See also I2P, RUSH
  • Trap Door A hidden software or hardware mechanism that permits system protection mechanisms to be circumvented It is activated in some non-apparent manner (e g , special "random" key sequence at a terminal) (DoD 5200 28-STD)
  • Trojan Horse A computer program with an apparently or actually useful function that contains additional (hidden) functions that surreptitiously exploit the legitimate authorizations of the invoking process to the detriment of security For example, making a "blind copy" of a sensitive file for the creator of the Trojan Horse (DoD 5200 28-STD)
  • Trusted A Trusted system or component is one whose failure can break the security policy See also Trustworthy
  • Trusted Path A mechanism by which a person at a terminal can communicate directly with the TCB This mechanism can only be activated by the person or the TCB and cannot be imitated by untrusted software (DoD 5200 28-STD) See also TCB, Trusted Software
  • Tunneling Refers to the encapsulation of protocol A within protocol B, such that A treats B as though it were a data link layer See also Tunneling Router
  • Tunneling Router Router or system capable of routing traffic by encrypting it and encapsulating it for transmission across an untrusted network, for eventual de-encapsulation and decryption
  • the FLAMERouter and RUSHrouter are both tunneling routers See also FLAMERouter, RUSHrouter, Tunneling
  • UDP User Datagram Protocol.
  • TCP Transmission Control Protocol
  • IP Internet Protocol
  • UNCAPTM Untrusted Node Computation, Adaptation, & Persistence.
  • the secure, proprietary protocol used for communication between MASTER-led teams and the SERVANTS ( ⁇ e , untrusted nodes) that "belong" to them UNCAP appears to be used for RUSHrouter-to-RUSHrouter communication also, but this is only coincidental, since every RUSHrouter comprises at least one SERVANT UNCAP is always tunneled via the RUSH protocol, but unlike RECAP, there is no expectation of trustworthiness among its participants See also FLAMERouter, RECAP, RUSH, RUSHrouter, Tunneling Router
  • the usability of a system involves three potentially conflicting factors how quickly users can do what they want to do, how correctly they can do it, and how much they enjoy doing it
  • the underlying design of a computer system can affect its usability Designing usability into a system involves analyzing users' needs, and then designing around those needs while optimizing the three factors
  • USB Universal Serial Bus.
  • High-speed USB 2 0 allows data transfer up to 480 Mbps, which is 40 times faster than full-speed USB Due to signaling overhead, the USB 2 0 standard appears to have a throughput limitation of around 25 to 30 MBps, or is about half of what is implied by the raw data rate
  • WAN Wide Area Network
  • WLAN Wireless LAN.
  • VMM Virtual Machine Monitor. Equivalent to hypervisor responsible for supervising virtual machines In SHADOWS, the VMM is part of the BOSS role
  • VLAN Virtual LAN
  • VPN Virtual Private Network
  • a primary objective of the SHADOWS infrastructure is to establish a highly survivable, essentially maintenance-free shared platform for extremely high-performance computing ( ⁇ e , supercomputing) - with "high performance” define both in terms of total throughput, but also in terms of very low-latency (although not every problem or customer necessarily requires very low latency) - while achieving unprecedented levels of affordability (both capital and operational expense) - that is capable of earning a deserved reputation for trustworthiness, survivability, and fault tolerance
  • the idea is to use distributed "teams” of nodes in a self-healing network as the basis for managing and coordinating both the work to be accomplished and the resources available to do the work
  • the SHADOWS concept of "teams” is responsible for its ability to "self-heal” and “adapt” its distributed resources in an “organic” manner
  • the "teams” themselves are at the heart of decision-making, processing, and storage in the SHADOWS infrastructure Everything that's important is handled under the auspices and stewardship of a team
  • node is the smallest addressable unit of intelligent storage, or working storage A node comprises at least one processor (to do computational work), along with some mix of volatile and non-volatile memory (to provide information storage)
  • SHADOWS nodes can be organized into "machines,” and machines can be organized into “sites” - and these are the basis of the two primary conceptual SHADOWS building blocks
  • the subject machines are SCRAM machines
  • the subject sites are SUREFIRE sites
  • SCRAM machines are miserly in their energy usage and are self-contained (including computing, networking, persistent storage, power generation, etc ) SCRAM machines do not need computer-friendly environments (you could safely drop one into a lake without damaging it), so they can easily be distributed to multiple sites, which need not be data centers (any physically secure location may be appropriate) SCRAM machines are essentially very small, self-contained datacenters, except that they require external power and, to some degree (depending on the threat profile), physical protection
  • SCRAM machines are designed for deployment to unmanned/unattended locations (e g , underground) and require no routine maintenance
  • a SHADOWS "site” is defined as a group of SHADOWS machines (whether or not they are SCRAM machines) that share the same or approximately the same GPS coordinates (within some radius and/or margin of error) and are interconnected with a multiplicity of switching and/or routing communications fabrics
  • one or more SCRAM machines would be co-located at a site - in a particular kind of highly survivable facility referred to as a SUREFIRE site
  • the SUREFIRE Freestanding Vault is a preferred embodiment, and by design its minimal configuration would enjoy the lowest cost of the four example unmanned configurations if deployed in volume, which would enable affordable, widespread deployment
  • the packaging of all of its major components makes it equally at home in a datacenter, office building, warehouse, basement, or on the roof Even though it contains its own multifuel power plant, it requires less than 50 square feet of floor space, including room for maintenance access
  • the SUREFIRE Freestanding Vault can be configured to support various levels of performance in the sub-TFLOPS to 20 TFLOPS range, per vault
  • the SUREFIRE Mini-Silo is a preferred embodiment, and by design its minimal configuration would enjoy the lowest cost of the three example underground configurations if deployed in volume, which would enable affordable, widespread deployment
  • the packaging of all of its major components is tailored especially to a silo configuration (a cylindrical shape approximately 3 feet in diameter)
  • the SUREFIRE Mini-Silo can be configured to support various levels of performance in the sub-TFLOPS to 10 TFLOPS range, per silo
  • the SUREFIRE Single-Level Underground Vault is an alternate embodiment - a larger diameter silo - that could be affordably produced in fairly low quantities (relative to the SUREFIRE Mini-Silo), and is able to accommodate a higher degree of conventional equipment than the SUREFIRE Mini-Silo
  • the SUREFIRE Single-Level Underground Vault is especially well-suited to supercomputing accompanied by significant radio communications (the silo itself serves as the base for relatively lightweight communications towers)
  • the SUREFIRE Single-Level Underground Vault can be configured to support various levels of performance in the 0 5 TFLOPS to 10 TFLOPS range, per silo
  • the SUREFIRE Multi-Level Underground Vault is an alternate embodiment - also in a silo configuration - that is likely to require a somewhat substantial level of site engineering and preparation prior to deployment A typical deployment scenario would be underneath (literally) a commercial-class wind turbine (100 KW or more) While the basic design is straightforward to replicate, its site preparation unlikely to be, due to the facility depth and likely permitting issues
  • the SUREFIRE Multi-Level Underground Vault can be configured to support various levels of performance in the 2 TFLOPS to 50 TFLOPS range, per silo
  • a SHADOWS "mesh” (which may also be a “neighborhood” and/or “community”) is a group of SHADOWS sites in the same locale, sharing proximate GPS coordinates and interconnected with a meshed network of point-to-point and point-to-multipoint links, augmented by WAN links (in a preferred embodiment, a diverse multiplicity of terrestrial and satellite channels are used to achieve specific survivability goals)
  • a SHADOWS "region” is typically (but this may be defined by policy) the collection of WAN-connected (at least) SHADOWS sites supplied (or potentially supplied) by the same utility power grid (thus, in the U S , for example, there are four regions under this definition, but other definitions are possible also) Adjacent regions may also enjoy mesh-like point-to-point or point-to-multipoint interconnections, which may have the effect of collapsing two or more physical (or policy-defined) regions into a single logical region
  • a SHADOWS "theater” is a collection of WAN-connected sites which, for our purposes, is essentially distinguished by some combination of geographical, political, military, legal, and technical considerations that force special or self-similar treatment throughout the collection Examples of theaters are North America, Western Europe, China, Japan, Australia, the stratosphere, the troposphere, LEO satellites, MEO satellites, the moon, and Mars (this is clearly a non-exhaustive list)
  • SHADOWS "universe" is the total collection of SHADOWS theaters, whether interconnected by any means whatsoever, or even disconnected
  • SHADOWS considers its entire network ( ⁇ e , the SHADOWS universe) as the basis for optimization
  • SHADOWS considers both the sending and receiving links for every SHADOWS node along a path
  • SHADOWS drivers include the current and probable future availability of resources, and the maintenance of adequate reserves to ensure appropriate levels of survivability
  • the Firewall, Link-Aggregator-Multiplexer, & Edge Router (FLAMERouter) capability lives at the interface between a SHADOWS supercomputing node and all external network connections (LAN and WAN)
  • One of its primary responsibilities is to cooperate with the FLAMERouters of other SHADOWS nodes in order to transparently and logically connect each SHADOWS node to the others, optimally, using the Scrutiny RECAP (Reliably Efficient Computation, Adaptation, & Persistence) protocol over any and all channels available (private and public)
  • Scrutiny RECAP Reliably Efficient Computation, Adaptation, & Persistence
  • a key goal is to handle traffic as though all the nodes were connected on an amalgam of VLANs and VPNs (but without the VLANs and VPNs), taking extraordinary measures as necessary to avoid partitioning of the "virtual network "
  • SHADOWS FLAMERouters Another key role of the SHADOWS FLAMERouters is to safeguard the SHADOWS communications channels, not only to prevent denial of service (which includes resisting DDOS attacks), but also to prevent traffic analysis, so as to render the SHADOWS network opaque FLAMERouters use active techniques, in conjunction with the SELF subsystem, to classify both inbound and outbound traffic as friendly, benign, or malicious Friendly traffic (as determined by SELF) is granted the highest priorities Benign and malicious traffic are both allowed, depending on the properties of the traffic itself, but are closely managed by the FLAMERouters so as to meet the specific needs of SHADOWS (non-self traffic is desirable for mixing purposes, as part of defending against traffic analysis by attackers, but must be limited to exactly the desired bandwidths, while ensuring that no malicious traffic is allowed to propagate)
  • the FLAMERouter processes can be implemented in software and/or hardware, but in a preferred embodiment are implemented primarily in reconfigurable hardware, under the auspices of dynamic configuration software, and under the control of the BOSS (Byzantine Object & Subject Security) and SELF (Secure Emergent Learning of Friends) subsystems
  • the SHADOWS RUSHrouter behaves much like the FLAMERouter, but is designed for deployment to client locations, where it can serve as a host-resident proxy or default gateway, or live in the DMZ as a server, edge router, and default gateway
  • the primary role of the RUSHrouter is to enable and manage secure communications between client machines and RUSHrouters, between RUSHrouters and FLAMERouters (indirectly, because a RUSHrouter never knows when it is communicating with a FLAMERouter, which can emulate RUSHrouters), and among RUSHrouters, all under the auspices and control of the FLAMERouters
  • RUSHrouters communicate natively using Scrutiny's RUSH and UNCAP protocols
  • the RUSH (Rapid Universal Secure Handling) protocol focuses on meeting the needs of clients ( ⁇ e , on the client-side of the RUSHrouters)
  • the UNCAP (Untrusted Node Computation, Adaptation, & Persistence) protocol is a subset of RUSH and focuses on communications between the SHADOWS infrastructure and any SERVANTS that are implemented on client machines
  • RUSHrouters can also communicate (like a residential gateway/firewall) with arbitrary Internet destinations, including to and through overlay networks (e g , the anonymizing networks TOR, I2P, etc ), and can do so by using any and all available connections (like a FLAMERouter)
  • Client preferences especially firewall and bandwidth preferences
  • RUSHrouters are under the control of FLAMERouters, they technically do not actually communicate with them directly, since FLAMERouters are generally invisible except to specially privileged devices (and specifically, not to RUSHrouters) Instead, RUSHrouters communicate with a multiplicity of what they "think" is FLAMERouter, but is in actuality a SHADOWS MARSHAL (Multiprocessor Adaptive Scheduler & Task Executor/Redirector)
  • SHADOWS MARSHAL Multiprocessor Adaptive Scheduler & Task Executor/Redirector
  • a MARSHAL is much like a RUSHrouter, except that it lives not on the client side, but out in the Internet itself, typically in data centers or network hubs where multiple high-bandwidth connections are available RUSHrouters and MARSHALS work together to route, mix, aggregate, and manage traffic, under the auspices and control of the FLAMERouters
  • RUSHrouters and MARSHALS may be directed to send traffic to FLAMERouters (thinking they're sending it to another RUSHrouter or MARSHAL, because the destinations aren't recognizable as FLAMERouters)
  • Only legitimate ( ⁇ e , authorized) traffic is ever directed to the FLAMERouters, although this may include both benign and malicious traffic (if desired by the FLAMERouters, but only to the extent so desired)
  • a compromised RUSHrouter or MARSHAL that directs unwanted traffic (malicious or not) toward the FLAMERouters may face appropriate countermeasures
  • RUSHrouters are oriented to client-side functions
  • MARSHALS are oriented to "middleman" functions
  • FLAMERouters are oriented to server-side functions, yet they all can at least appear to emulate each other, to a point
  • Any FLAMERouter can emulate any number of RUSHrouters and MARSHALS, and so can communicate directly with them without revealing itself
  • MARSHALTM Multi-Agent Routing, Synchronization, Handling, & Aggregation Layer.
  • a distinguished SERVANT node having the responsibilities of fulfilling a MARSHAL role Any node, authenticated as having a MARSHAL role, that serves as a gateway for system users to access SHADOWS services via a network (e g , the Internet)
  • a MARSHAL may also communicate with other MARSHALS, under the auspices and control of a MASTER-led team, in order to implement one or more overlay networks and/or network fabrics whose purposes and characteristics are determined by the MASTER-led team (but are opaque to the MARSHALS)
  • a MARSHAL is not trusted, and the role is typically fulfilled by a SERVANT node (which is also inherently untrusted)
  • a SLAVE emulating a MARSHAL
  • SLAVE emulating a MARSHAL
  • DELEGATETM Distributed Execution via Local Emulation GATEway.
  • a distinguished SERVANT node having the responsibilities of fulfilling a DELEGATE role
  • the DELEGATE role implements secure client-side "proxy" agent that appears to locally implement a particular service which would normally be implemented elsewhere, such as on a local or remote server, but instead may actually be implemented within the SHADOWS network cloud
  • the DELEGATE proxy handles both stateless and stateful communication (the latter may be expected to be "chatty") with the client-side software requesting service, such that the DELEGATE proxy translates requests to and from the RUSH protocol as needed
  • an open-source DBMS API like that of, say, MySQL or PostgresSQL is implemented as a DELEGATE
  • the MySQL or PostgresSQL DELEGATE can then be run locally on an arbitrary machine (e g , a PC or server), and any software applications that expect the selected DBMS may run as though it were present
  • the selected DBMS may appear to be local, its operations may actually be carried out on the SHADOWS supercomputing infrastructure, there is no need for database replication, because the survival and integrity of distributed data is intrinsic to the SHADOWS architecture
  • Any number of authorized subjects at any authorized locations can similarly instantiate the selected DBMS DELEGATE, and they may all be sharing the same database (if that is what is called for), or diverse databases, as required
  • one application requires one DBMS, say MySQL, and another requires a different DBMS, say PostgresSQL, and a third application requires an OpenLDAP server, and a fourth requires an Apache web server
  • four appropriately selected DELEGATES can be instantiated on the local machine
  • Each DELEGATE may implement the requisite local API, but can communicate (via the RUSH protocol) with a local set of virtual RUSHrouters, which can communicate (again, via the RUSH protocol) with the distributed SHADOWS infrastructure, where the actual computing operations can be carried out in accordance with an appropriate SLA
  • the DELEGATE concept can be applied to common Internet-based services, including DNS, email (POP3, SMTP, etc ), VoIP (SIP), and so forth
  • the DELEGATE concept is applied to HPC-class interprocessor communications by implementing an MPI API
  • a SHADOWS "machine” comprises one or more nodes sharing a common chassis or other container of some sort, without regard to specific packaging
  • SCRAM' is one such machine, its extruded aluminum chassis is cylindrical in shape, comprising a set of Quadrants, each of which comprises a set of Lobes and an optional set of Blades x
  • the mam section ( ⁇ e , the vertical upright portion) is a single large aluminum extrusion with an overall diameter of about 25" (including cooling fins not shown)
  • the main section is split into three identical interlocking sections (one per 90° quadrant), each of which has a maximum diameter of ⁇ 20"
  • the main section is a single aluminum extrusion with an overall diameter of about 12" to 13" (including cooling fins not shown), with the other dimensions and capacities scaled as needed (while maintaining similar aspect ratios) Faceplates and/or IO panels that attach via the quadrant interlocking mechanism are used to cover surfaces exposed by the "missing" fourth quadrant
  • Inner Diameter The “inner” diameter is smaller than depicted in order to increase the interior room, and is assigned a cooling function
  • Lower extruded section is three interlocking "outrigger" sections (one per quadrant) that are identical large extrusions with a maximum “diameter” (cross-sectional length) of about 28" in a preferred embodiment (or somewhat less than 17" in an alternative small-form-factor embodiment not shown), or six interlocking "outrigger” sections (two per quadrant) that are identical large extrusions with a maximum "diameter” (cross-sectional length) of about 20" in a preferred embodiment (or about 12" in an alternative small-form-factor embodiment not shown, with values proportional to dimensions shown)
  • the coolant sump and pumps are accessed from the "open" side (where the missing quadrant is)
  • the unit is serviced from the top Note that the sump is normally dry, except in the rare case of accidental spills (all the working fluid couplings are blind-mating and self-sealing)
  • the pumps are small and nominally dissipate less than 25 watts each, while pumping up to 1200 LPH ( ⁇ 317 GPH, or ⁇ 5 3 GPM) or providing pressures up to 3 5 bar (50 PSI)
  • the SCRAM Supercomputer is designed to be self-contained storage-wise, with up to 32 full-size (3.5-inch) disk drives per quadrant, or (preferably) 128 small-form factor (2.5-inch) disk drives per quadrant
  • An Introductory Limited Edition might ship with 2 nearline outrigger blades, each populated with 16 drives of 80 GB, or 1 28 TB per blade, for a total of 2 56 TB
  • the selected 80GB drives are at a sweet spot for price and performance
  • the lower capacity drives because many more drives can be purchased for the same amount of money, and more spindles means higher levels of parallel access
  • the highest density 2.5-inch SAS disk drive has a raw (uncompressed) capacity of 146 GB, so the maximum hard disk storage capacity possible with 128 drives is 18.7 TB per quadrant (146GB x128), or 56 TB for the chassis
  • dual-ported SAS drives there are 256 channels of access (300 MBps each), rather than 128 channels (all SATA drives are only single-ported)
  • each NEARblade is a 16-dr ⁇ ve SAS/SATA hybrid, consisting of 4 to 8 dual-ported SAS drives for speed and 8 to 12 single-ported SATA drives for high capacity and cost reduction Note that despite the fact that typical SAS drives (10K RPM) are much faster than high-capacity SATA drives (5400 or 7200 RPM), both are considered "only" nearline storage in a SCRAM Supercomputer
  • a 4-SAS, 12-SATA hybrid with the drives noted above would have 572 GB of high-performance drives (via 8 channels) and 3.6 TB of high-capacity drives (via 12 channels), for a maximum total capacity of just over 4 TB per NEARblade (via 20 channels) A full complement of 24 such blades would yield 96 TB of hybrid storage (13 7 TB SAS, 86 4 TB SATA) with 2007 technology
  • Each storage blade is likely to weigh 15 to 20 pounds (16 drives plus frame, thermal conductors and coolant) If all the drives in a bay were spinning at once, they would require 160 to 300 watts of power, depending on the mix of SATA and SAS drives
  • blades are either top-loaded or front-loaded, but must be selected for maintenance and powered down before removal This is a matter of authenticating, making a menu selection, and waiting for a light to indicate that the blade is ready to be removed, and that the solenoid- controlled blade-latching mechanism is unlocked
  • An outrigger blade such as a drive bay can be removed without shutting down the SCRAM lobes in the corresponding quadrant
  • the outer fins Due to the very large surface area, the outer fins provide substantial cooling even in the absence of data center-style air conditioning Phase-change working fluid is circulated in the outer walls, causing the vapor to condense under normal circumstances
  • the walls containing the optional inner fins also incorporate fluid circulation channels, and can provide cooling when forced air is available (say, from a data center underfloor air conditioning system)
  • a high- reliabihty, low-noise blower is also contained in the base (as a backup) to supplement other means of cooling during over-temperature conditions
  • the fluid channels in the inner walls are distinct from the fluid channels in the outer walls, and may be used separately, although there is a relatively low-resistance conduction path in the current design because they're contained in the same all-aluminum extrusion
  • a SCRAM node is composed of 1 to 4 quadrants O
  • Each quadrant contains 4 lobes ⁇ that are fully connected to each other and to the lobes in the other quadrants
  • Each quadrant controls up to 8 optional "outrigger blades" ⁇ (discussed elsewhere), in any combination, and each blade is fully connected to each lobe ⁇ in the corresponding quadrant O
  • one or more of the blocks depicted above as (optional) "outrigger blades” also are implemented internally ( ⁇ e , within a lobe) in a non-bladed manner, so that the specific means are also built into the lobe and provide the corresponding capability inherently ( ⁇ e , without the need for optional outrigger blades), in order to reduce the cost of a basic configuration
  • Each lobe's workload is handled by SELF/CHARM blocks that function symbiotically to securely store, retrieve, and process information using an associative memory hierarchy
  • the SELF roles of BOSS, MASTER, and SLAVE are each paired with a CHARM PUMP capability that is tailored for the particular role
  • the pairings (BOSS & PUMPO, MASTER & PUMP ⁇ , and multiple SLAVEs with multiple PUMPs ⁇ ) are depicted without arrows to emphasize the symbiotic coupling
  • Each pairing includes one or more means for processing, along with one or more levels of local memory and/or cache Note that, in a preferred embodiment, the multiple SLAVES with multiple PUMPs ⁇ in a one-to-one configuration are replaced with one or more PUMPs ⁇ , each having a multiplicity of SLAVEs ⁇
  • the SELF means and the CHARM means may each be implemented via one or more traditional CPUs (SMP or not), programmable and/or reconfigurable logic (e g , FPGAs, ASICs, etc ), or even discrete logic, or any combination thereof, including implementation of a pairing or multiple pairings on a single chip using any combination of means
  • the BOSS/PUMPO and MASTER/PUMP ⁇ pairings are implemented via a single CPU handling the BOSS & MASTER functionality, and a single FPGA or Structured ASIC handling both their respective PUMP functionalities
  • the SLAVE/PUMP ⁇ pairings are each implemented via a single CPU handling the SLAVE functionality and a single FPGA or Structured ASIC handling the corresponding PUMP functionality
  • each lobe has a PEERS O switching & routing fabric, but in a preferred embodiment there are actually at least two redundant fabrics working together in an active/active configuration
  • SCRAM machines provide a solid foundation for the SHADOWS infrastructure, which is highly distributed, with inter-node communications occurring globally over WANs, quasi-locally within a locale via WLANs, and locally (within a site) via a multiplicity of LAN switch fabrics and/or meshes Nonetheless, the SHADOWS infrastructure is designed to "play nice," which allows it to safely participate in other networks, in various roles (e g , supercomputer, NAS appliance, a complete SAN deployment, etc ) - all as a first-class citizen Furthermore, the SHADOWS infrastructure is designed to take advantage of idle or unused computing, storage, and communications resources associated with the networks to which it is attached, as authorized, in order to maximize its supercomputing throughput while minimizing the cost of doing so The SCRAM machines provide the magic that makes it possible
  • a SCRAM machine comprises four major logical functions, and thus four major types of means SELF, CHARM, CORE, and FRAME
  • the SELF means defines roles for key architectural entities and enables secure, trustworthy, high- performance cooperation among those entities in the SHADOWS infrastructure
  • the CHARM means comprises a local hardware implementation of a secure, distributed ( ⁇ e , local node plus multiple remote nodes) hierarchical and associative memory processing system, with overlaid relational capabilities and a compressed persistent store
  • the CORE means comprises the processes and protocols related to the implementation of an associative memory, reasoning and belief systems, and cooperative processing and communications protocols
  • the FRAME means comprises the hardware and processes for survivably and securely energizing and maintaining the system
  • each of a SCRAM machine's Lobes comprises at least one MASTER and typically at least one SLAVE
  • both MASTERS and SLAVES typically comprise multi-core general purpose processors, but may optionally comprise special-purpose processors, including without limitation, devices or modules comprising fixed or reconfigurable logic such as ASICs, FPGAs, and so forth
  • Each MASTER is further distinguished by its isomorphic association with unique instantiations of BOSS and SELF (which are implemented at least partly in secure, immutable hardware)
  • a "node” could refer to the SCRAM machine itself, or any of the Quadrants, Lobes, Blades, MASTERs or SLAVEs, or even the processors, whereas they collectively determine the Machine 6.1 SCRAM Subsystem
  • SELF is an automated role-oriented "immune system” that differentiates “self and “non-self, “friend” and “foe” - thus, said system may distinguish between authorized and unauthorized objects, subjects, and interactions
  • SELF may establish and maintain trust among interdependent systems, subsystems, and components
  • SELF may integrate with BOSS (see section 7 1 3) to incorporate Byzantine agreement logic (from the classic "Byzantine generals" problem) into its decision-making process, so that it may make correct decisions in the face of overt or covert attack, collusion, and corruption
  • SELF may be highly integrated with BOSS, and with the RECAP, UNCAP, and/or RUSH protocols
  • any anomalous behavior detected by SELF, or of which SELF becomes aware may trigger an appropriate "immune system" response
  • the idea is to use distributed "teams” of nodes in a self-healing network as the basis for managing and coordinating both the work to be accomplished and the resources available to do the work
  • the SHADOWS concept of "teams” is responsible for its ability to "self-heal” and “adapt” its distributed resources in an “organic” manner
  • the "teams” themselves are at the heart of decision-making, processing, and storage in the SHADOWS infrastructure Anything that may be important may be handled under the auspices and stewardship of a team
  • the purpose of having teams is at least five-fold 1) to distribute the automated resource management overhead, 2) to partition, parallelize, and distribute the actual processing load and improve overall performance, 3) to increase the fault-tolerance of the system, 4) to increase the inherent survivability of the system, and 5) to increase the difficulty of successfully attacking the system
  • SHADOWS Although distributed, the SHADOWS infrastructure cannot be correctly described as strictly centralized or strictly decentralized It is definite not centralized in the sense that a traditional mainframe or supercomputer is intentionally centralized Neither is it decentralized, in the sense that a peer-to-peer network, or perhaps a grid network - is intentionally decentralized (so as to avoid centralized functionality, which often requires significant trade-offs) Rather, SHADOWS is a little of both, in a "Borg"-l ⁇ ke way SHADOWS might best be described as having a conceptually centralized function that happens to have local representation, but a highly decentralized implementation
  • every MASTER is the leader of at least one team to which other MASTERS, both local and remote, are also assigned Depending upon the nature of a particular team, there may also be non-MASTER participants, and these may be voluntary (SERVANTS) or non-voluntary (SLAVEs)
  • a potential MASTER can be a SERVANT ( ⁇ e , it can "volunteer"), but cannot lead a team
  • a SERVANT is a useful, but untrusted, "working storage” resource - it is capable of storing, retrieving, and forwarding encoded, encrypted information (but not decrypting or decoding it)
  • a SERVANT doesn't possess enough information to make decrypting and decoding possible, regardless of the computing resources available to a would-be attacker
  • a SERVANT is also capable of executing ⁇ n-memory processes against information securely received but not stored, under the auspices of a
  • Some SERVANTS are assigned a MARSHAL role, which adds to their responsibilities, but not to their trustworthiness (like the SERVANT, the MARSHAL role is inherently untrusted)
  • Every SHADOWS resource, task, and identifiable entity of any virtually kind is assigned to a team, as are all users
  • the SHADOWS "Boig-Iike" operational team concept may be vaguely iemintscent of the physics concept of "quantum entanglement” — a quantum mechanical phenomenon in which the quantum states of two or more objects have to be desci ibed vv ith ieference to each other, even though the individual objects may be spatially separated
  • a SHADOWS team comprises, for example, at least five active MASTERS two colocated MASTERS, LocalMASTER_1 (the team leader) and Loca!MASTER_2, and three non-colocated MASTERS RemoteMASTERjl , RemoteMASTER_2, and RemoteMASTER_3
  • This minimal team is sufficient to maintain Byzantine agreement in the face of one Byzantine fault (e g , a single corrupt MASTER) or one failed site " (e g , due to a regional disaster)
  • SHADOWS may tolerate a combination of up to c known crashed or unresponsive MASTERS and up to f faulty or malicious (but unknown) MASTERS, where (c+2f) ⁇ (n-k) The mechanisms for accomplishing this are further explained in section 7 1
  • a SHADOWS team can form and begin "rounding itself out," by virtue of extending its membership as the SHADOWS infrastructure grows
  • potential MASTERS that cannot yet participate in a MASTER role may volunteer as SERVANTS, and thus become immediately usable by any and all existing SHADOWS teams
  • potential MASTERS may qualify to become MASTERS, in which case they can be assigned to one or more SHADOWS teams in subordinate (non-team-leader roles), and can also be assigned teams of their own as new teams are formed Note that non-team-leader roles can become "acting" team leaders at any time, if their superiors are unable to perform their roles
  • indmdual subsystems may still function sufficiently as to be able to call home and contact remaining portions of the SHADOWS infrastructure In such a circumstance the surviving resouices w ill be assimilated back into the infrastructure as SERVANTS if they cannot qualify or ie qualify as MASTFRs
  • SHADOWS architecture acknowledges this as a starting point, although there is ieason to believe that 3f+l may be ovei ly conservative However, because survivability and tiust are key to SHADOWS, conservatism is quite acceptable In any case, if ?/+/ is too conservative, then achieving 3f+I means that a largei numbei of faults may be tolerated w Hh no actual changes
  • SHADOWS uses a linear MDS code (e g , a v ai iant of Reed-Solomon) to achieve Bv/antine agreement 7.1.1.6 Teams are Stewards of Information to be Stored or Handled
  • any team receives an artifact containing information to be stored or handled, say from another team, or an external source, it is analyzed at least sufficiently to classify the information boundaries if not already known (for example, it is helpful to know the granularity of the object, such as whether it is a file, database record, or email message, etc )
  • the artifact's ContentDigest is computed at the coarsest granularity, and then looked up in a local "RecognizedContentlndex” to find out if the artifact or its content has been previously handled
  • the actual lookup occurs by first checking the SubmittmgTeam's "local copy " " of the RecognizedContentlndex, and if not found, sending a lookup request message containing the ContentDigest to the accountable ContentStewardTeam If found either way, then the ContentAlias is now known, and, from a simplistic viewpoint, the storage request has essentially been "fulfilled
  • the information to be stored is "new" by definition
  • the ContentStewardTeam may eventually be responsible for assigning a ContentAlias to the information to be stored, pairing it with the associated ContentDigest, and "publishing" it to the SHADOWS infrastructure
  • assignment cannot occur until the artifact and its information content is received and vetted according to the ContentStewardTeam's rules, because the assignment of a ContentAlias is both automatic and permanent, and thus, by design, cannot be changed later
  • an artifact such as a book, which contains unstructured information from the viewpoint of a DBMS (database management system), for example, but yet clearly has some sort of structure based on its inherent natural boundaries and granularities (e g , entire book, chapters, pages, paragraphs, sentences, etc )
  • the entire book has a single identity
  • Each of the chapters also has an identity, as do each of the pages, each of the paragraphs, each of the sentences, and so on
  • the difference in identities between the content of two editions of a particular book may boil down to the specific areas where they differ in content, and this may occur anywhere along the granularity spectrum This is likewise true for artifacts that are purported to be different - their actual differences can be discovered and revealed
  • Every SHADOWS process is an artifact, and thus has an identity, and thus is assigned to one or more teams, each of which has a particular role with respect to that process
  • SHADOWS teams cooperatively share the management responsibilities of each artifact, and process artifacts are no exception
  • One SHADOWS team is responsible for storing a particular process ( ⁇ e , its executable image is an artifact), another for verifying it prior to distribution or execution, another for executing it, another for
  • the Submitting Team is actually part of a computing clustei ot some sort, so the local copy of the RecognizedContentlndex is most likely distributed over the local cluster, meaning that even a local lookup entails sending a message to the appiopnate local team responsible for that particulai slice of the RecognizedContentlndex monitoring its execution, etc
  • "software rejuvenation" is called for, multiple SHADOWS teams are involved on a cooperative basis
  • the SHADOWS FACTUAL capability is conceptually "just" a memoization system, but one that is designed to operate at global scale and supercomputing speed, with the high levels of security and survivability commensurate with the SHADOWS infrastructure Teams are used to perform the processing required to arrive at previously unknown results, and to reach consensus on "vetted" results prior to memoization (which is particularly important for FACTUAL, because memoized results can be reused as authoritative results that sidestep process execution)
  • the various content and identities associated with memoized results need to be stored, which involves teams on a SHADOWS-wide basis, as does the lookup of memoized results
  • the problem to be solved is queued for processing, but can normally be dequeued if a vetted, memoized result is obtained prior to the start of execution
  • a memoized result that is obtained after execution has already started can be used as a test oracle to verify the result, thereby serving as a built-
  • Rejuvenation can include a new version of executable from the same source, with no functional changes (using 1-way translation to deter reverse-engineering)
  • Each node maintains a bitmap of globally active process-port combinations, including ports in transition (e g , due to version update) Assuming one bit per process-port, this requires at most 64 Kbits or 8 KB
  • An encrypted, authenticated bitmap is distributed periodically and upon request via RECAP Active process-port updates are also distributed periodically via RECAP, as are periodic authentication requests to verify a non- corrupted image at each node Incorrect or missing responses trigger SELF reporting and likely escalation
  • Any message received on a globally non-active port constitutes behavior that is both a diagnostic clue and/or a SELF clue, as is any message received on an active but not-ready port at a particular node
  • the latter could be legitimate within a short window corresponding to propagation delay, if the sender did not receive a not-ready update in time to prevent message transmission In the latter case, the sender must immediately follow up with a retraction message within a specific time period if the difference in message timestamps (request time minus not-ready time) exceeds the allowable maximum (which is designed to accommodate propagation and update delay)
  • the timely receipt of an authenticated retraction message (say, within a second, or some other policy-specified threshold) prevents escalation
  • Each node maintains a map (e g , a bitmap) of site-local volunteer nodes - nodes whose load is sufficiently light (both absolutely and relatively) that they can accommodate a higher-than-average load (which means that "ready" virtual SERVANT processes and/or SERVANT VMs, possibly running on some combination of MASTERs and SLAVEs, can be granted execution resources)
  • a map e g , a bitmap
  • this bitmap of site-local volunteer nodes can be represented in only 1 KB
  • Each node on a multi-node "street" updates others on its street via street-local multicast, and they take turns updating their neighborhood
  • Each node in a neighborhood takes its turn updating its community, and each node in a community takes its turn updating the other communities in the site
  • the site-local volunteer bitmap can be ANDed with the site-local process-port bitmap for a particular process-port combo (which is updated in the same manner) in order to find volunteer nodes for the process- port Volunteers are typically sought at a higher rate than draftees (which can be any node with a ready process-port)
  • bit-sliced index manipulation Typically, a random or pseudo-random number is generated to find a starting bitmap offset, and the next available bit is selected (or next N bits if more are needed) A full word size of bits can be read at once Compressed bitmaps are also possible (see bit-sliced index manipulation)
  • BOSS is a distributed, timely, trusted computing base (TCB) and object/subject security system that incorporates Byzantine agreement logic (from the classic "Byzantine generals" problem) in its decision- making process, and collectively makes security decisions in a "fail-silent” manner that provides survivability even in the face of multiple failures and/or corrupted nodes
  • TBC trusted computing base
  • BOSS is implemented and instantiated only in conjunction with a MASTER, and works in conjunction with CHARM to control who gets access to what, and when, while ensuring that unauthorized information is not exposed (not even to other internal systems)
  • BOSS implements the TCB, and thus owns the security of the system
  • the rest of the components rely on BOSS to make correct security decisions - and it must make 100% of the security decisions
  • Any BOSS node that fails or becomes corrupted can be restarted or replaced, and in any case cannot be trusted until its trustworthiness can be re-established from scratch to the satisfaction of the surviving trusted nodes, including, at a minimum, other MASTERS with which it previously participated as a team member Keeping in mind that every MASTER is associated with a BOSS component (and that BOSS is a distributed function), refer back to "A SHADOWS Team Comprises Members with No Common Regional Threats" on page 34 for more information on Byzantine agreement
  • BOSS is designed to enable the SHADOWS infrastructure to support both classified and unclassified information processing and storage (e g , to meet or exceed Common Criteria (CC) Protection Profiles (PP) such as the U S DoD Remote Access Protection Profile for High Assurance Environments Version 1 0, June 2000, nominally at EAL5, or potentially at EAL6 if implemented by a single, qualified development organization)
  • CC Common Criteria
  • PP Protection Profiles
  • the DoD defines the TCB as the totality of protection mechanisms within a computer system - including hardware, firmware, and software - the combination of which is responsible for enforcing a security policy
  • a TCB consists of one or more components that together enforce a unified security policy over a product or system
  • the ability of a trusted computing base to correctly enforce a security policy depends solely on the mechanisms within the TCB and on the correct input by system administrative personnel of parameters (e g , a user's clearance) related to the security policy (DoD 5200 28-STD)
  • TCSEC1983 defines the TCB as "the totality of protection mechanisms within a computer system, including hardware, firmware, and software, the combination of which is responsible for enforcing a security policy
  • Any "Candidate MASTER” that is unable to establish itself as a MASTER may retains its candidacy but is unable to fulfill any of the responsibilities of a MASTER Rather than waste the resources of such a candidate, it may "volunteer” (or attempt to volunteer) to operate under the auspices of a team of MASTERS, in the role of SERVANT (Service Executor, Repository, & Volunteer Agent - Non-Trusted)
  • Each MASTER is distinguished from other MASTERS and from non-MASTERs by a set of inherent traits and capabilities possessed only by MASTERS and "Candidate MASTERS" (which are singleton, would-be MASTERS that have not been accepted and deemed trustworthy by a sufficient quorum of other MASTERS and/or Candidate MASTERS, and thus have not yet attained "MASTER-hood")
  • each MASTER has a one-to-one correspondence with, and physical attachment to 1) a BOSS device or subsystem that has a universally unique cryptographic identity, and 2) a SELF device or subsystem that can cryptographically establish whether the BOSS device or subsystem and any other arbitrary entity claiming to be part of the same system are indeed parts of the same "self "
  • "self in this context refers to a bona fide SHADOWS infrastructure This test is somewhat like a DNA-based identification test where parts of the same 'self share a common DNA sequence, so that in concept, your nose and your right hand could both "claim" to be part of the same self (e g , "you") - and the claim could be definitively verified
  • a MASTER or "Candidate MASTER” is also behaviorally related to other MASTERS and “Candidate MASTERS” that are part of the same SHADOWS infrastructure, and these behaviors are intended to be collectively inimitable
  • SHADOWS infrastructure an ability to imitate behavior intended to be inimitable is merely inconclusive - only the converse is true "If it does not walk exactly like a duck, OR it does not talk exactly like a duck, then it is not a duck ⁇ ”
  • any non-self behavior by a MASTER or “Candidate MASTER” is taken as evidence of counterfeit *
  • a shunned MASTER can be rejuvenatid at the first sign of misbehavior
  • misbehavioi ol for example, a communication channel used for MASTER-to- MAST ER communication
  • DoS denial-of-service
  • Each MASTER is the primary steward of several sets of resources, and for each such set, leads a team of MASTERS that is collectively responsible for that set of resources, despite the simultaneous failure or corruption of any number of MASTERS (up to a policy-specified threshold) Failed and/or corrupted MASTERS (including the team leader) are adaptively tolerated until detected, at which point they are replaced
  • a system's resources essentially refer to its capacity as a network of "working storage” comprising the areas of communications, processing, storage, and energy
  • the communications resource area comprises connectivity and bandwidth, as well as quantitative quality levels for each (connectivity comprises availability and reliability, for example, and bandwidth comprises rate, latency, and jitter, among others)
  • the processing resource area comprises the ability and readiness to accomplish particular tasks (with accompanying arrival rates, service rates, etc , as well as quantitative quality levels)
  • the storage resource area comprises the ability and capacity to store information to transient and/or persistent memory and subsequently retrieve it (further comprising various addressing means and rates, with accompanying quantitative quality levels)
  • the energy resource area comprises the various energy sources and sinks (for example, having sufficient energy to power a combination of system components during a particular time window, and to absorb, store, or reject any waste energy produced during that same window), with accompanying quantitative quality levels
  • Each MASTER maintains a viewpoint of the resources claimed to be available in the system, both locally and elsewhere, including its own, in a radial fashion
  • each resource claim is associated with a reputation that can be used to weight that resource claim Relative proximity to the center (as represented by distance from a set of local MASTERS) determines relative update detail and frequency For example, local resources (those comprising the center) are the most detailed and frequently updated, whereas nearby (but non-local) resources are less detailed and less frequently updated, and remote resources are the least detailed and least frequently updated
  • each MASTER summarizes the resources for which it is responsible, normalizes the summary to a format that is standardized among the local MASTERS, and shares it with its immediate peers ( ⁇ e , the other local MASTERS) on a mutually agreeable schedule
  • the schedule of local updates is both event-driven and periodic, but the period is actually time-varying on a prearranged basis, as agreed among the local MASTERS (failure to meet the time-varying requirements provides a hint to SELF that may trigger an "auto-immune" response)
  • each MASTER also uses the local resource summaries provided to it by its immediate peers (the same set of peers referred to in the previous paragraph) and creates a further summary comprising their collective local resources, then normalizes the collective summary to a format that is standardized among those peers, and shares the collective summary with non-local-but-nearby MASTERS on a mutually agreeable schedule (Conceptually, in a set of concentric rings centered on the local MASTERs, these non-local-but-nearby MASTERS would correspond to the nearest larger ring)
  • the schedule of next-ring updates is both event-driven and periodic, but the period is actually time-varying on a prearranged basis, as agreed among the local MASTERS (failure to meet the time- varying requirements provides a hint to SELF that may trigger an "auto-immune" response)
  • BOSS Byzantine agreement via BOSS is used locally by the BELIEF (Bayesian Emergent Learning & Intelligent Evaluation of Facts) subsystem to create a 4-b ⁇ t reputation estimate for each process at each node, including its own, based on its belief in reputation estimates proffered by others, which
  • the notion of radial proximity can be substituted ⁇ ith a hierarchical notion based on fixed granularity - e g , neighboihood/town/state/country are weighted by their own reputations and normalized to a 4-b ⁇ t result
  • the local BOSS subsystem maintains its own view of the 4-b ⁇ t reputation estimate for each process at each node, including its own, as a rolling average that can be queried at a rate independent of its update rate
  • the w-bit ClaimWeightedByReputation value Given a c-bit ClaimedValue and an r-bit ReputationForClaimedValue proportional to the confidence in the claimant with respect to such claims (or perhaps overall), where 0 is worst-case, and (2 C )-1 and (2 r )-1 are the respective best-case values for each variable, the w-bit ClaimWeightedByReputation value can be calculated as
  • ClaimWeightedByReputation (ClaimedValue * ReputationForClaimedValue) /(2 lc+r w ) where all variables are integer, w ⁇ (c+r), and (c+r-w) is usually a constant The division by a power of 2 can be accomplished with a simple right-shift of its exponent, yielding
  • ClaimWeightedByReputation (ClaimedValue * ReputationForClaimedValue) » (c+r-w)
  • ClaimWeightedByReputation (ClaimedValue * ReputationForClaimedValue) » (4 + 4 - 4)
  • the idea is to complement - but avoid the necessity of - conventional synchronous (lockstep) execution of identical instruction on identical CPUs at exactly the same time, in a high-availability (HA), duplicate modular redundancy (DMR) or triple modular redundancy (TMR) configuration with voting logic to determine the correct outcome
  • HA high-availability
  • DMR duplicate modular redundancy
  • TMR triple modular redundancy
  • the lockstep approach is useful for quickly detecting and handling transient errors and hardware problems, including software errors associated with their proper handling (by masking such errors when possible)
  • This approach can greatly improve the availability of a particular machine in a friendly environment, but it does nothing for the availability in a hostile one Thus, if the HA server or site is compromised or taken down, the system immediately becomes either untrustworthy or unavailable
  • SHADOWS assumes that there are no completely trustworthy individual nodes, but that consensus among a policy-determined quorum is sufficient to warrant trust In particular, consensus is reached on the final result rather than on each instruction involved in its calculation
  • This approach has the significant advantage of being able to incorporate arbitrary levels of diversity in multiple dimensions, such as geographic locations, political environments, security mechanisms, algorithms, software versions (e g , differentiating among authors, skill levels, code versions, programming languages, build environments, certification levels, etc ), CPUs, memory systems, EMP/radiation hardening, physical access controls, etc )
  • voters may actually be data consumers not involved in the calculation (that is, they're not the "Byzantine generals") In this scenario, the voters do not have the actual results, and must first receive results from the producers ( ⁇ e , from the Byzantine generals) First, the producers can each calculate the appropriate result (including compression, encryption, and a CRC or message digest as appropriate) However, rather than each producer transmitting the entire result to each consumer, the producer then computes an FEC-encoding of the result message (with a suitable rate code) and extracts its "share” of the FEC-encoded message, which it also encrypts, and to which it may add a MAC (message authentication code) and/or digital signature The "share” is then transmitted to any consumers that need it, along with identifying information as appropriate for the communications protocols in use The code rate (n,k) used determines the number of uncorrupted "shares" ( ⁇ e , k of n) that must be received in order
  • Data consumers may be addressed individually via unicast, or collectively via multicast, but in both cases the ability of a group of authorized (but not necessarily trusted) producers to send FEC-encoded slices (or "slivers") to the consumers greatly increases the likelihood that each consumer receives the correct desired data at the maximum rate it can be received (such as when limited by the consumer's aggregated inbound bandwidth, which may be greater than the individual outbound rates of the individual senders)
  • the BOSS subsystem is built on the principles of a Trusted Computing Base (TCB) as are known by those skilled in the art In addition, however, BOSS is further constructed with accurate time-keeping mechanisms and hardware support for the logic and processing required to implemented a Timely TCB, (TTCB), whose principles are also known in the art, but to a lesser degree, and are well described elsewhere
  • TTCB Timely TCB
  • the BOSS subsystem may also be implemented (or emulated) in software, given an environment is sufficient to meet the particular set of needs
  • the novel BOSS hardware is tamper-proof or tamper-resistant and provides enables synchronization and reconciliation of multiple time-keeping sources that are authoritative to varying degrees (e g , local atomic clocks, local crystal-controlled oscillators, terrestrial radio or satellite-based signals such as WVW 1 GPS, etc )
  • the BOSS hardware also securely implements various cryptographic processes and provides secure encrypted storage of associated variables, keys, and so forth
  • the BOSS hardware also securely implements the error and/or erasure coding mechanisms described below, that enable the use and application of forward error correction (FEC) as described below
  • FEC forward error correction
  • SHADOWS can tolerate a combination of up to c known crashed or unresponsive MASTERS and up to f faulty or malicious (but unknown) MASTERS, where (c+2f) ⁇ (n-k)
  • Each of the n peers uses the same information basis to independently perform the computation (which should be identical to those created by the other n-1 participating peers)
  • k which may vary with context determines how many correct slices - and thus, how many correct peers - are required to reconstruct the consensus result
  • This technique contributes to Byzantine fault-tolerance, since up to (n-k) faulty contributors can be ignored (however, the SELF and BOSS subsystems take note of such failures)
  • each peer can simply share a single slice with the others, and the specific slice to be shared is tied to the relative number of each peer within the set of n collaborating peers (e g , peer 1 shares slice 1 , peer 2 shares slice 2, and so on, up to peer n, which shares slice n)
  • Each peer digitally signs its slice so that recipients can verify whose it is ( ⁇ e , that it was actually provided by the peer with which it is identified) Any peer that shares the "wrong" slice, or a "corrupted” slice, or fails to share a slice, becomes included in the set of up to (n-k) faulty contributors whose slice can be ignored during this round of computation (but the responsible peer is noted, of course)
  • the consensus result can be independently reconstructed locally without further communication
  • each sending peer's values for n' and k' are independent of those used by the other peers, and may be heavily influenced by the properties of this communications channels, since extra redundancy is may be appropriate
  • Each peer represents a single potential ⁇ >zantine fault oi failure, and thus gets only one ⁇ ote in the original consensus iesult
  • one peer may independently decide to send 20 slices to a third party, such that any 15 slices is sufficient to reconstruct the sender's original slice of the consensus result
  • the sender may opt to split the slices up among the available channels such that each channel handles a few of the slices, according to its individual data rate, congestion, reliability, etc.
  • the authorized third-party entities need only to receive any legitimate k' of n' slices from a given sender to reconstruct that sender's single slice of the consensus result
  • the authorized third-party entities need only to reconstruct legitimate slices from any k of n senders to reconstruct the original consensus result
  • SHADOWS native processes do not push data around as loads are shifted and requests are made, etc Instead, IDs are pushed around, and if a process actually needs the associated data, it can request it (on a "pull" basis), or, if there are no other operands, just forward the request to the team that owns the data (resources permitting)
  • the act of pushing an ID has the effect of putting the team owning the associated data on notice that it may be needed soon, essentially identifying the ID as a speculative prefetch opportunity
  • a SHADOWS native process has an input queue and an output queue, as depicted above, on the left Ignoring security issues, the input queue accepts tuples of the form ⁇ TxID, Operand ID List ⁇ , performs the work of the process which is to generate one or more Result IDs, then enqueue them for distribution
  • the transaction id (TxID) ties the Operand IDs (received as input) to the Result IDs associated with the processing results
  • the input queue [1] accepts tuples of the form ⁇ TxID, Operand ID List ⁇
  • An interior process immediately fires off requests [2] for the actual data associated with the Operand IDs, requesting the teams that have the specific data to send it along to a specific destination team (which may be the current team or some other one)
  • a message may also sent to the specified destination team to put it on notice that data for the particular TxID may soon be arriving (unexpected data may trigger defensive behavior)
  • the data When the data is retrieved by the team that owns it, it may be sent to the specified destination team [3], thus, for a given TxID, the outbound message [2] and the inbound message [3] are unlikely to occur on the same machine, unless the sender of [2] specifically wishes to receive and process the data [3] within its own team for some reason
  • the specified destination team collects the operands [3] and stores them in RAM [4]
  • it may then enqueue process descriptors [5] (including pointers to the operand data in RAM) into the input queue of the embedded "simple process " The simple process may later dequeue the process descriptor [6] and associated data, perform its process and enqueue the "raw" results [7] to its output queue
  • a postprocessor may dequeue the raw results [8] in order to create one or more digests of the results, thereby generating one or more Result IDs
  • the Result IDs and raw data may be "pushed" to
  • Each MASTER may have its own viewpoint of the entire system based on its own local statistics and a global (i e , non-local) statistical summary of the rest of the system, so that it may "think globally, but act locally " Statistics may be summarized into simple percentiles (a preferred embodiment may use quartiles, such that any statistic may be summarized in just four states - requiring just two bits - for the purpose of decision- making)
  • Each MASTER may communicate its local statistics to its immediate neighbors, both periodically and whenever a state-change occurs
  • the neighboring MASTERS may individually calculate the statistics for their "neighborhood” and communicate them upward to a higher aggregation level
  • the neighboring MASTERS may "take turns” rolling up the statistics for their "neighborhood” and communicating them upward to a higher aggregation level
  • a systems thinking diagram as depicted in Fig. 7.3.2-1 may help teach how each MASTER'S work- scheduling decision-making may be influenced by the "forces" associated with the current values of the system variables
  • the arrows marked with “S” represent forces or trends that may cause one system variable to affect another in the same direction ( ⁇ e , the pointed-to variable may be influenced to go up or down as the other one does)
  • the arrows marked with "O” represent forces or trends that may cause one system variable to affect another in the opposite direction ( ⁇ e , the pointed-to variable may tend to go down if the other one goes up, or up if the other one goes down)
  • MASTERS may also delegate work to SLAVEs and/or SERVANTS, but only those for which they are responsible
  • the two primary variables driving resource-balancing are the AverageLocalNodeUtilization O and the AverageNonLocalUtilization ⁇ Examples: A utilization of 100% means that the node is operating exactly at the desired utilization goal (which may be somewhat less than its raw capacity), whereas 70% means that its capacity is underutilized by 30% Likewise, 150% means that 50% more work is queued at the node than it was intended to handle all at once It could still be accomplished eventually, but service level agreements (SLAs) might not be met, and customer satisfaction might suffer An average utilization that continually exceeds 100% is one indication that system capacity should be increased
  • a LocalNodeRelativeWorkload ⁇ value of +20% means that the local workload is 20% above the average workload of the other nodes ( ⁇ e , excluding itself), whereas -20% means that the local workload is 20% below the average of the other nodes
  • This percentage can then be compared to the current percentile (or quartile, etc ) thresholds, in order to classify the workload of the local node relative to the other nodes
  • One goal is to determine which "load category" the local node fits best , such that one of four values (which requires 2 bits to represent) can be assigned for classification purposes, e g Very Heavily loaded (3), Heavily Loaded (2), Lightly Loaded (1 ), Very Lightly Loaded (0) While any eligible node can be delegated to, those with the lightest load may receive a statistically larger fraction of any delegated work, and those with the heaviest load may receive the smallest fraction (and possibly zero) Workload adaptation is continuous, by enjoys relatively low overhead due to the hysteres
  • VM virtual machine
  • Examples of systems that could be tailored to the SLAVE PUMP in a straightforward manner may include the open source software “Xen” (now a part of Linux), “OpenVZ,” and others, as well as the experimental software “Denah” (and its variants, from the University of Washington), both of which are well-suited to Unix, Linux, and various BSD environments
  • Each PUMP implements two inter-PUMP HT links Ideally, these are 32-b ⁇ t HT links, but could be 16-b ⁇ t if FPGA I/O pins are insufficient
  • PUMP 0 is distinguished in that it terminates both ends of the daisy chain, while the other PUMPs each implement an HT tunnel (with a cave for each PUMP's local functions)
  • one of PUMP O's links connects to PUMP 1 and the other to PUMP 2 PUMP 1 and PUMP 1 also connect to each other
  • the number of PUMPs is limited primarily by requirements of the type of bus chosen (e g , HT is limited to 31 devices)
  • Each PUMP also emulates at least one HT tunnel (with a cave for PUMP functions) between a pair of HT links connected to pairs of corresponding processors (represented here as a pair of MASTERS and multiple pairs of SLAVEs)
  • PUMP 0 is shown with only one pair of processor-to-PUMP HT links for the MASTER processors (plus a ClearSpeed ClearConnect Bus interface), whereas the other PUMPs are depicted with two pairs of processor-to-PUMP HT links (two pairs of SLAVEs per PUMP are hoped for)
  • these processor-to-PUMP HT links are 16-b ⁇ t, but could be 8-b ⁇ t if FPGA I/O pins are insufficient (16-bit processor-to-PUMP HT links are more important to PUMP 0 than to the other PUMPs)
  • one of the processors could be replaced by a specialized processor adhering to the processor's bus protocol Examples would include replacing an AMD Opteron processor with a DRC or XtremeData coprocessor Such a coprocessor would have the same access to memory as the replaced processor
  • PUMP 0 is associated with the MASTER processors, which are responsible for module initialization, etc 8 CHARM - Compressed Hierarchical Associative & Relational Memory
  • a SCRAM node is composed of 1 to 4 fully connected quadrants, each quadrant contains 4 lobes and controls up to 8 optional "blades" (discussed elsewhere), in any combination, and each blade is fully connected to each lobe in the corresponding quadrant
  • Each lobe comprises a number of means whose conceptual interaction is depicted above
  • one or more of the blocks depicted above as (optional) "blades” also are implemented internally ( ⁇ e , within a lobe) in a non- bladed manner, so that the specific means are also built into the lobe and provide the corresponding capability inherently
  • a MASTER CPU or SMP O works cooperatively and symbiotically with a MASTER PUMP ⁇ via a direct communication path (for example, HyperTransport)
  • the MASTER CPU or SMP O typically has a multiplicity of volatile DRAM memory channels (preferably SECDED or Chipkill ECC) that it makes partly available to the MASTER PUMP ⁇ (e g , some memory is set aside for local use, and some is allocated to the MASTER PUMP ⁇ )
  • the MASTER PUMP & has a multiplicity of non-volatile, high-reliability low-power DRAM memory channels that it makes partly available (e g , as a block device) to the MASTER CPU or SMP O
  • a MASTER CPU or SMP O works cooperatively and symbiotically with one or more SLAVE PUMPs ⁇ via a direct communication path (for example, HyperTransport)
  • Each SLAVE PUMP ⁇ typically has a multiplicity of SLAVE CPU/SMP devices O associated with it, and that are entirely dependent on the SLAVE PUMP ⁇ for all non-local input/output
  • each SLAVE PUMP ⁇ emulates any devices required to bootstrap each of its dependent SLAVE CPU/SMP devices O, and well as all communications and storage devices, so that all aspects of the software execution environment for the SLAVE CPU/SMP devices O are under control of the SLAVE PUMP ⁇ , which is acting on behalf of the cooperative pairing of MASTER CPU or SMP O and MASTER PUMP ⁇
  • Each SLAVE CPU/SMP device O typically has a multiplicity of volatile DRAM memory channels (preferably SEC-DED or Chipkill ECC) that is entirely available to the SLAVE PUMP ⁇ , which is
  • one or more of the means collectively labeled here as “Blades” also are implemented internally ( ⁇ e , within a lobe) in a non-bladed manner, so that the specific means are also built into the lobe and provide the corresponding capability inherently
  • HyperTransport is used to create a system bus for connecting a MASTER CPU or SMP O (comprising at least 3 HT links) with a MASTER PUMP ⁇ and two SLAVE PUMPs ⁇
  • the MASTER CPU or SMP O is responsible for initializing the various HT devices it can reach, up to any non-transparent bridges, as well as any bridged non-HT devices such may be attached to or reached via the PEERS fabrics ⁇ , up to any non-transparent bridges
  • the MASTER CPU or SMP O has a multiplicity of volatile DRAM memory channels (preferably with SECDED or Chipkill ECC) that can be accessed by the PUMP devices ⁇ and ⁇ as appropriate, via the HT links
  • the MASTER PUMP ⁇ has a multiplicity of non-volatile, high-reliability low-power DRAM memory channels that it makes partly available (e g , as a block device) via any of its HT links (in a preferred embodiment,
  • various combinations of single or multiple MASTER CPU or SMPs O, MASTER PUMPs ⁇ , SLAVE PUMPs ⁇ , and HT-b ⁇ dges (which are implicit in the PEERS fabrics ⁇ and "Blades" ⁇ , due to the presence of PCI Express), up to the maximum number of addressable HT devices (note that, by design, the number of SLAVE CPU/SMP devices O is not included in this count), can be coupled together in a double-ended daisy-chain fashion
  • the HT connection from MASTER PUMP ⁇ to CRAY SeaStar ⁇ is exemplary of a connection to a vendor-specific interface, in this case to any of a family of CRAY- specific communications chips (SeaStar, SeaStar2, etc ) designed to provide high-performance communications between components of a supercomputing system Note that the HT connection to CRAY SeaStar ⁇ (or some other bridged interface) could alternatively originate at either (or both) of the SLAVE PUMPs ⁇ rather
  • a MASTER CPU or SMP O works cooperatively and symbiotically with one or more SLAVE PUMPs ⁇ via its direct HT paths, for the primary purpose of implementing high-performance computing cluster using "slaved" commodity processors (e g , the SLAVE CPU/SMP devices O), which need not be homogeneous
  • Each of the SLAVE CPU/SMP devices O typically has (or in any case, is required to have) only a single HT link, if any (and if none, then is interfaced to a device that can supply at least one)
  • a key aspect of the SLAVE PUMP ⁇ is its ability to directly interface to a multiplicity of HT devices having only singleton HT links, and to provide communications with and among them and with other devices in the system as authorized, without requiring the attached devices have multiple HT links of their own (which is normally required for HT-based multiprocessor communication)
  • the SLAVE PUMPs ⁇ implement external interfaces comprising a 16-bit HT tunnel pair, and at least five 16-btt non-transparent bridged HT device ports (which can either be internally switched, such as with a crossbar switch, or implemented internally as a set of connected HT tunnels, where each non-transparent HT device is simply an HT bridge with a tunnel, and the tunnels are connected in series)
  • the SLAVE PUMP ⁇ also implements an HT cave with its own functionality comprising logic, internal local memory, and input/output queues, all operating under the auspices of the cooperative pairing of MASTER CPU or SMP O and MASTER PUMP ⁇
  • any internal HT tunnels and paths within the SLAVE PUMPs ⁇ are of maximum width (as of this writing, the HT standard specifies a maximum width of 32 bits)
  • the MASTER PUMP ⁇ implements external interfaces comprising a 16-bit HT tunnel pair, and at least one 16-b ⁇ t bridged HT device port that can be configured as either a transparent or non-transparent bridge, depending on the intended use
  • the MASTER PUMP ⁇ also implements an HT cave with its own functionality comprising logic, internal local memory, external local memory, and input/output queues, all operating under the auspices of its cooperative pairing with the MASTER CPU or SMP O
  • its adaptive transparent/non-transparent bridged HT device port can be alternatively implemented as two 8-b ⁇ t adaptive transparent/non-transparent bridged HT device ports
  • any internal HT tunnels and paths within the MASTER PUMP ⁇ are of maximum width (as of this writing, the HT standard specifies a maximum width of 32 bits)
  • the BOSS/PUMPO and MASTER/PUMP ⁇ pairings are implemented via a single CPU handling the BOSS & MASTER functionality, and a single FPGA or Structured ASIC handling both their respective PUMP functionalities
  • the SLAVE/PUMP ⁇ pairings are each implemented via a single CPU handling the SLAVE functionality and a single FPGA or Structured ASIC handling the corresponding PUMP functionality
  • Each processor (each is designated here as either a MASTER or SLAVE) has a relatively small, but dedicated, "local” memory (shown as a pair of red bars above) that is closely matched to the needs of the specific processor For an Opteron processor, a pair of dual-channel DIMMs well-matched to the clock rate would be anticipated
  • Each PUMP has nine (9) parallel 72-b ⁇ t ECC-protected memory channels, typically implemented with commodity DIMMs that lie at the economic sweet spot of performance, capacity, and price
  • each PUMP supports at least a second bank, for a total of 18 DIMMs accessible 9 at a time
  • the 72 bits from each of the 9 DIMMs is corrected to 8 bytes (64 bits) of data per DIMM
  • 64 bytes is both the size of an Opteron cache line and the minimum payload in an HT packet
  • the internal PUMP buffers must handle blocks of at least 648 bytes (8 accesses of 9 channels at 9
  • Each PUMP can deliver a 64-byte cache line refill in a single memory access, using a single minimal HT packet
  • a multi-line cache refill can require the same number of accesses as cache lines, which provides a 4x throughput improvement over a dual-channel memory configuration - plus another dimension of error correction - and the data still fits in a single HT packet Since the processors mostly operate out of their own local dual-channel memory banks, there is a surplus of memory bandwidth, and this is used to support FPGA-based accelerators within the PUMP as well as accelerators connected to the PUMP (such as the ClearSpeed chips, e g , CSX600, that connect via the ClearConnect Bus)
  • Each PUMP also has several banks of 72-b ⁇ t ECC-protected NVRAM (shown above as an array of red dots), some of which are likely be implemented on stacked daughterboards attached to the CHARM module via stacking connectors (1 or 2 banks per stack)
  • the appropriate buffer sizes depend on the typical currently available chips, which are currently limited to on the order of 128K x 8 each, so each bank would have a capacity on the order of 1 MB
  • each processor's local memory (shown as a pair of red bars above) can be implemented with components that trade off density for speed, while optimizing for the economic "sweet spot " From the perspective of a processor's memory controller and/or MMU, this local memory is the processor's "main memory” (or, in the case of the Opteron processor, a slice of it), and it can serve this purpose well as long as it is significantly larger than the largest cache supporting it
  • the PUMP's NVRAM is a somewhat limited resource intended primarily for internal use by the PUMP logic to maintain metadata related to flash and disk-based storage, and to buffer critical data until it can be safely distributed and stored in a higher capacity distributed memory
  • the FPGA-based associative memory algorithm logic (including the CHARM FASTpage logic) resides in each PUMP
  • the FASTpage logic uses NVRAM to persistently maintain the ternary search tree meta data and buffer updates to search tree memory pages before they're written to flash memory (flash is the primary storage media for persistent associative memory)
  • One bit of a CHARM object id is used to determine whether or not an object is immutable ( ⁇ e , not mutable) An immutable object is one whose content cannot be changed (for any reason)
  • mutuable object is essentially an incomplete work in progress -- once the work has been completed, the object becomes immutable
  • transactional contributions from distributed processes building a search tree would all be targeted to the immutable object id of the tree being constructed (the targeted immutable object id would be known to the distributed processes)
  • Mutable object ids are NOT reusable After an object has become immutable, the associated mutable object id is no longer valid (except in an audit context), and its use would be automatically detected as a security issue
  • transient means “temporary” and persistent means “permanent,” and these terms may be used interchangeably
  • Transient ids are used in two different contexts
  • Transient ids are ultimately reusable, but only after their previous use has been verifiably flushed from the system Active transient ids are managed (issued and revoked) in large-ish blocks to minimize overhead
  • CHARM cannot allow them to be stored in persistent storage, including NVRAM
  • all decrypted objects fall into this category (no object can be stored in the clear, ever) - which means that a first-pass deletion of all in-the-clear objects can occur instantaneously by de-powering ( ⁇ e , removing the power from) the volatile RAM where they're temporarily stored *
  • a bidirectional mapping is maintained such that given either the Persistent Immutable object id or the message digest, the other can be determined
  • a cryptographic message digest (its dna_tag) is computed for the object to be stored (which may or may not be in the clear), and then the object is compressed and encrypted, at which point a second, outer" cryptographic message digest is computed for the encrypted result and concatenated to the encrypted result, which concatenation then becomes the basis for the FEC encoding process described below, by treating it as the "original data” and dividing it k ways, i e , into k slices
  • reconstruction of this "original data” from its k slices (or their FEC-encoded siblings) does not actually make the original in-the-clear data available, but only a compressed, encrypted (and therefore still secure) isomorphism of it
  • Successful reconstruction of the still-encrypted data from its k slices can be verified only if the key
  • the original k slices of data are used to encode up to n redundant "slices," any k of which is sufficient to reconstruct the original data Since a systematic (n,k) code is used, the original k slices (which are included in the set of n slices) are also sufficient to reconstruct the original data
  • no more than one slice of data from a single object is allowed to be stored on a particular device, precluding the loss of more than one of an object's slices due to a single device failure or theft Note that multiple slices (each from a different object) may be stored on each device, without penalty
  • any node or device entrusted with a slice can further encode the data for sub- dist ⁇ bution using an (n',k) FEC code, thereby encoding up to n' redundant "slivers," any /c' of which is sufficient to reconstruct the original slice Slivers are particularly useful for highly distributed "nearline” storage, such as when data is distributed to non-trusted devices (PC-based storage, servers, or any SERVANT, or device running or emulating a SERVANT)
  • PC-based storage, servers, or any SERVANT or device running or emulating a SERVANT
  • no more than one sliver of data from a single object is allowed to be stored in a single data-handling unit (e g , a disk sector, file, or record, etc ), precluding the loss of more than one of an object's slivers due to localized hard or soft failures (e g , bad disk sector, corrupted file, etc )
  • a single data-handling unit e g , a disk sector, file, or record, etc
  • the number of slices or slivers that can be co-located in the same facility and/or within a geographic region is constrained to be less than some policy-specified threshold
  • the capacity required on a particular device, or at a particular facility is only a fraction of what would be required to store an arbitrary object
  • SHADOWS nearline activities can include co-locating slices of
  • CHARM uses FEC (forward error correction), and specifically systematic codes such Reed-Solomon (RS), Cauchy Reed-Solomon (CRS), and/or others, with encoders and decoders implemented in software and/or hardware
  • FEC forward error correction
  • RS Reed-Solomon
  • CRS Cauchy Reed-Solomon
  • encoders and decoders implemented in software and/or hardware
  • a Reed-Solomon code variant (or similar)
  • n packets are generated and any k of them can enable reconstruction of the original data
  • the code can support up to r "erasures" (missing packets where it is known which packets are missing), or r/2 errors (where some packets are in error, but which ones they are is unknown)
  • CHARM storage-based data and data-in-transit tend to require only erasure correction (because there are other means for identifying missing or corrupted packets), whereas volatile memory-based data tend to require full error detection
  • FEC is applied in multiple dimensions, with different parameters, as appropriate to the purpose and nature of the data, which may differ, for example, among the various contexts associated with local vs distributed data, transient vs persistent data, mutable vs immutable data, data in transit vs data at rest, public data vs private data, etc Other considerations and requirements may apply as well, such as requisite levels of security, integrity, availability, persistence, survivability, etc
  • the remaining 256-k packets (or n-k packets, in general) contain redundant data in accordance with the selected FEC algorithm If any of the first k packets are missing, any packets from the remaining 256-k may be substituted, but FEC decoding must occur in order to reconstruct the original data
  • a key associated with (but, in a preferred embodiment, distinct from) the data's underlying security key can be used to seed a PRNG
  • the ordinal positions of the k original packets can be determined directly, allowing aggregation without the overhead of FEC decoding (assuming all the original k packets are available)
  • the same technique can be applied to group the remaining 256-k packets according to accessibility (locale, storage level, etc ) For example, the next most-accessible m packets can be distributed among the remaining 256-k packets in exactly the same way
  • each number is a transreal value, and thus, in addition to the set of integers and real numbers, also includes +/- infinity and NULL
  • the implementation is further enhanced to include a small set of signed, infinite precision, symbolic constants such as ⁇ (pi) and e, along with a small set of others, plus a means for referring to an extended set of mathematical and/or scientific constants (typically irrational numbers ' ) which are generally known in the art
  • the CHARM infinite precision floating point representation requires variable length word size on 1-byte boundaries, with at least an 8-b ⁇ t word (1 byte), which may be represented as having bits numbered from 0 to 7, left to right
  • the first bit is the sign bit, 'S'
  • the second bit is the exponent extension flag, 'e'
  • the third bit is the exponent bit
  • 'E' the fourth bit is the fraction extension flag, 'f
  • the final four bits are the fraction bits, 'FFFF
  • the value 'V represented by the word may be determined as follows:
  • V (the ratio of integers specified by the following number pair)
  • V (the value of the dereferenced OID)
  • V (the value of the dereferenced BLOB descriptor and content)
  • V (+) (the extended constant indexed by the following integer)
  • V (-) (the extended constant indexed by the following integer)
  • Rational numbers can be represented by a ratio of integeis (i e , a pair of numbeis coi responding to a numeiator and a denominaloi)
  • Rational_Number_Indicator byle indicates that a pair ol numbeis tollows, each of which is a variable-length signed integer 1"
  • An OID is a variable-length value in a format somewhat similar to the numeric formal
  • the fc (exponent) field is replace by a 1 (type) field, and the
  • a BLOB desciiptoi is a variable-length value in a format somewhat similar to the numeric foimat (followed by the described content)
  • the E (exponent) field is ieplace by a T (type) field, and the F (fractional) field is replaced by a C (content) field
  • Specific field values aie also different
  • the IEEE double precision floating point standard representation requires a 64-bit word, which may be represented as numbered from 0 to 63, left to right The first bit is the sign bit, S, the next eleven bits are the exponent bits, 'E', and the final 52 bits are the fraction 'F'
  • the value V represented by the word may be determined as follows
  • V NaN ( Not a number)
  • CHARM and particularly, PUMP and CORE, can include Perspex processing within its native capabilities
  • a Perspex matrix is a matrix of 16 numbers arranged in 4 rows and 4 columns that can be handled as a single operand, with a minimum of 16 bytes (1 byte per number)
  • each number in a Perspex matrix is a transreal value, and thus, in addition to the set of integers and real numbers, must include +/- infinity and nullity (NULL in CHARM)
  • the CHARM implementation is further enhanced to include a small set of signed, infinite precision, symbolic constants such as ⁇ (pi) and e, along with a small set of others, plus a means for referring to an extended set of constants
  • ⁇ (pi) and e a small set of others
  • a unique code value is assigned to every unique textual word occurring within selected lexicons anywhere in the system, and the code assigned is determined by word length and frequency of occurrence
  • word phrases are similarly assigned a code value
  • word codes and phrase codes are used to maximize internal compression and throughput 8.1.5.2 BASIC CONCEPTS
  • Canonical entries ( ⁇ e , lowercase words) are defined by their code, the associated text string, and a list of non-canonical entries
  • Non-canonical entries ( ⁇ e , mixed-case and uppercase words) are defined by their code, the code of the associated canonical entry, and a variable-length bit pattern (LEB128) that defines which characters need to be uppercase (one case bit per text character)
  • the figure above depicts a specific dual-switching-fab ⁇ c implementation of PEERS, as would be contained within 1 lobe from 1 quadrant of a SCRAM node (there are 4 such lobes in each quadrant)
  • the MASTER and BOSS blocks depicted here are intended only to indicate PEERS connectivity, and are external to PEERS itself
  • each lobe has a PEERS switching & routing fabric, but in a preferred embodiment there are actually several redundant fabrics working together in an active/active configuration
  • each lobe has at least a dual fabric working in conjunction with other lobes and quadrants that also have at least a dual fabric
  • HyperTransport is used to connect the processor/PUMP configuration to each PEERS fabric, which is primarily PCI Express (PCIe) to simplify off-board switching and routing of input/output (IO), so the primary connection to PEERS is via one or more HyperTransport Bridge/Tunnel interface chips O 1 which connect directly to the processor(s) ⁇ and PUMP(s) ⁇
  • PCIe PCI Express
  • IO input/output
  • HyperTransport Bridge/Tunnel interface chips O 1 which connect directly to the processor(s) ⁇ and PUMP(s) ⁇
  • Each HyperTransport interface is bidirectional and also double-ended (allowing configuration and control from either end)
  • a representative pair of nVidia n3600 chips O is daisy-chained to achieve a sufficient number of bridged HyperTransport-to-PCIe connections (56 lanes) for each fabric
  • the particular chip pair also includes additional IO (12 SATA lanes, 4 GBE ports, and 20 USB ports) that improve the economics of the system by being available "for free” and thereby eliminating the need for some chips that may otherwise be needed
  • the 12 SATA lanes from each fabric are distributed (not shown here) to the 8 Outrigger Blade positions as follows 4 distinguished blade positions each receive a set of 4 lanes ( ⁇ e , from each of 4 lobes, for a total of 16 lanes per distinguished blade position), and all 8 blade positions each receive 1 lane ( ⁇ e , from each of 4 lobes, for a total of 4 lanes in any blade position)
  • Each of the 4 GBE ports is connected to a separate 5-port GBE switch chip (not shown), with 1 such chip for each of the 4 lobes in a quadrant, leaving 1 unused switch port which is then provided as an external port This allows all 4 lobes in a quadrant to invisibly monitor and share the load for each external GBE port, and each quadrant externalizes 4 such ports (1 per lobe)
  • the 20 USB ports on each pair of nVidia n3600 chips O are distributed to a set of 20 internal USB connectors that provide low-latency access to (cheap) flash-based storage
  • Up to 4 PCIe/PCI-X bridges O (e g , PEX 8114) account for 8 PCIe lanes from the pair of nVidia n3600 chips O in each fabric
  • Each of the bridges O supports 4 NEC ⁇ PD720101 USB 2 0 host controllers ⁇ , each of which provides a root hub with 5 downstream-facing ports, for a total of 80 additional USB ports per fabric, or 160 additional USB ports per lobe
  • each set of 80 additional USB ports would typically be on a separate PCB would could be optionally omitted from a particular configuration (say, to achieve cost savings)
  • the CHARM technology stores only fractional, compressed, encrypted, FEC-encoded data on each flash drive and disk drive, using multiple lobes and quadrants (plus external nodes) to distribute the information
  • PCIe lanes from the pair of nVidia n3600 chips O 8 from each chip
  • a PCIe switch ⁇ e g , PEX 8548
  • e g , PEX 8548
  • a reserve of 8 additional PCIe lanes is provided and may be flexibly configured as needed using 1 to 4 ports (e g , 8x1 , 4x2, 4x1+2x2, 2x4, 4x1+1x3, 1x4, etc )
  • PCIe lanes from 1 chip in the pair of nVidia n3600 chips O are connected to a PCIe switch ⁇ (e g , PEX 8548) whose purpose is to provide a 4-lane communications path to the 4 other lobes in each of the 2 other quadrants (for a total of 8 lobes)
  • PCIe switch ⁇ for each of the lobe's switch fabrics
  • each fabric of a preferred embodiment 16 bidirectional PCIe lanes from 1 chip in the pair of nVidia n3600 chips O (i e , all 16 lanes from 1 chip) are connected to a PCIe switch O (e g , PEX 8548) whose purpose is to provide a 4-lane communications path to each of the 8 Outrigger Blades that share the same quadrant the as the lobe under discussion There is 1 such PCIe switch ⁇ for each of the lobe's switch fabrics
  • each of the 8 Outrigger Blades is therefore directly connected to each of the 4 lobes in the same quadrant via a 4-lane communications path on each of the lobe's switch fabrics
  • each blade has 16 lanes (4 lanes to each lobe) on each of 2 fabrics, for a total of 32 bidirectional lanes
  • readily available parts allow for a rate of 2 5 Gbps in each direction, per lane, aggregating to a 32-
  • HyperTransport is used to connect the processor/PUMP configuration ⁇ and ⁇ in each lobe to at least 1 CRAY Seastar (or Seastar2, etc ) chip ⁇ , which can be used to interconnect any number of SCRAM nodes with each other, and/or with other CRAY systems
  • CRAY Seastar or Seastar2, etc
  • a minimum of 4 SeaStar-based interfaces is available per quadrant, and each one offers 6 high-speed links with a sustained bidirectional throughput of 6 GB/second (by comparison, a fast FC link - such as a SAN interface - is 4 Gbps, which is more than an order of magnitude slower)
  • each line depicted between the blocks represents either 4 or 8 aggregated links
  • the design is intentionally partitioned in such a way that adding modules to the system also increases the capacity for communication among the modules This is accomplished by ensuring that the necessary fraction of the total capacity is directly available on each module, in contrast with the usual practice of placing the switching fabric on its own modules
  • the wiring of the switch fabric is more complex (especially between the "North” and “South” switches) than would be otherwise necessary, in the preferred embodiment the complexity can be relegated to an entirely passive wiring harness, flex circuit, or PCB
  • the HT-2100 has 24 PCIe links with support for up to 5 PCIe controllers, and can thus offer up to 5 independent ports, of which only 4 are needed in the preferred embodiment Two controllers would each aggregate 8 PCIe links and two would each aggregate 4 PCIe links, using a total of 24 links (out of 24 possible), but only 4 of the 5 available controllers
  • the 4 interfaces labeled A 1 B 1 C 1 D above correspond to the 4 PCIe controllers to be used (such as 4 of the 5 available on an HT-2100), and would each comprise either 4 or 8 aggregated PCIe links Each of the 4 controllers would be connected to a different switch fabric chip (all 4 or 8 of the links assigned to a particular PCIe controller would connect to the same switch fabric chip)
  • the HT-2100 has two HyperTransport ports (16x) with an integrated tunnel
  • the HT-2100 would be interposed on the HyperTransport interface between two Scrutiny PUMP devices, or alternatively, between a MASTER and a PUMP, or between two MASTERS
  • the number of interfaces (and lanes per interface) is highly dependent up the particular combination of HyperTransport-to-PCIe bridge chips used
  • the HT-2000 has two HyperTransport ports (16x and 8x) with an integrated tunnel Given a homogeneous combination of HT-2000 chips, for example, its 16x HyperTransport port would be connected upstream to the processor array, and downstream to an optional HyperTransport device (not shown)
  • the HT-2000 also has 17 PCIe lanes with support for up to 4 controllers
  • the 4 controllers would each aggregate 4 PCIe lanes, for a total of 16 lanes used
  • the 4 interfaces labeled A 1 B 1 C 1 D above would have its own PCIe controller and each interface would comprise 4 aggregated PCIe lanes
  • Each of the 4 PCIe controllers would be connected to a different PCIe switch fabric chip (each set of 4 lanes would connect to the same switch fabric chip), as depicted
  • IOPS Performance @ 1 KB When using smaller I/O sizes the number of IOPS increases accordingly For example, a normal SHADOWS FASTpage index access is 1 KB, not 4KB, so the number of IOPS per quadrant is increased by 4X, to 6.4 million IOPS per quadrant Furthermore, given a "better" USB flash (say, for example, 23 MBps vs 10 MBps, assuming 80% reads and 20% writes), this could improve further by a factor of 2 3X 1 to more than 14 million IOPS per quadrant (Note that some of the best drives claims speeds of up to 34 MBps read and 21 MBps write)
  • Affordability A single quadrant fully populated with 640 cheap 2GB flash drives ($15 each) would yield more than 1.2 TB of flash memory performing 1.6 million IOPS for under $10,000 If high-performance 4GB drives were used instead (about $30 each), a single quadrant would yield 2.5 TB of flash for under $20,000 However, the better drives are also at least twice as fast (23 MBps vs 10 MBps, assuming 80% reads and 20% writes), so the performance would double to 3.2 million IOPS for the same $20,000.
  • the duty cycle approaches k/n, where the storage is FEC-encoded with an (n,k) erasure code is used to establish redundancy, and where the survival of any k of the n fragments is sufficient to guarantee successful retrieval Nearline storage is maintained on a spun-down basis
  • SHADOWS NEARd ⁇ ves are used only for nearline storage, and contain only immutable objects (and more precisely, only fragments of objects)
  • Storage drives are preferably dual-ported, or connect to a dual-ported multiplexer
  • Non-active drives are spun up occasionally for data retrieval, if the information cannot be retrieved from the two active drives
  • the active drives cache mutually exclusive entries (objects) that already exist on the non-active drives
  • objects that already exist on the non-active drives
  • this is complemented by the presence of flash-based (or equivalent) caching of high-demand objects, also on a mutually exclusive basis
  • Mutual exclusivity maximizes the caching effect - preventing redundant caching enables more objects to be cached in a given amount of space
  • the active drives redundantly cache all objects that do not exist on the non-active drives Such objects are also cached in higher levels of memory and storage, at least until safely distributed and stored in accordance with storage policy (and any applicable SLAs)
  • the active drives Periodically (and with a fairly long period), the active drives are rotated (one at a time), so that, except initially, there is always an old one and a new one
  • the long period minimizes start/stop cycles, and the rotation levels the wear and MTTF
  • NEARdrive embodiments Three primary preferred NEARdrive embodiments are envisioned as most useful, a small, vacuum-sealed steel can, a "full-sized” storage “blade” or module, and a “Mini” version that has approximately the same dimensions as a single full-sized 3 5- ⁇ nch disk drive
  • NEARdrive Can implementations for the following storage configurations would be typical
  • Each NEARdrive Can has the same form factor, with the approximate dimensions of 3 25" diameter x 6"H (12"H for double-height can)
  • the standard height can is sufficient to accommodate at least 4 small form factor (SFF) drives (2 5" or smaller), plus 2 smaller 1 8" drives (optional)
  • each NEARdrive Can communications interface comprises a single 12-lane data connector that can support any combination of single-ported or dual-ported drives (up to 300 Gbps each in the current state of the practice, but this is not limited by the invention) requiring 12 ports or less
  • the number of lanes supported is arbitrary and can be reduced or increased as necessary, of course
  • one of the goals is to maximize the data throughput and IOPS during peak periods, which requires a minimum of 1 lane per physical drive (2 for dual-ported drives)
  • the number of connector lanes is reduced to 1 or 2, and 1 or 2 multiplexers are embedded inside the NEARdrive Can
  • the multiplexers allow switching between the contained disk drives based on software control mechanisms (SATA or SAS protocols)
  • Each NEARdrive Can requires an upstream host adapter channel for each lane in the data connector
  • Each NEARdrive Blade may optionally configured to be “intelligent,” with its own NEARdrive controller and switching logic, which case it has its own local MASTER (and possibly includes CHARM processing logic), or it may be “switched only,” in which case it operates under the control of a nearby MASTER
  • NEARdrive Blade implementations for the following storage configurations would be typical
  • Each NEARdrive Blade has the same form factor, with the approximate dimensions of 7"H x 2 5"W (thick) x 9"D This is sufficient to accommodate up to four full-sized (3 5 inch) drives or at least 16 small form factor (SFF) drives (2 5" or smaller)
  • each NEARdrive Blade may be single-ported or dual- ported (up to 300 Gbps each in the current state of the practice, but this is not limited by the invention)
  • Each NEARdrive Blade includes redundant SAS controllers, such that dual-ported drives are connected to independent controllers and switches within the blade
  • Each NEARdrive Blade may optionally configured to be “intelligent,” with its own NEARdrive controller and switching logic, which case it has its own local MASTER (and possibly includes CHARM processing logic), or it may be “switched only,” in which case it operates under the control of a nearby MASTER
  • Each NEARdrive Mini typically comprises a NEARdrive controller, and either four matching 2 5- ⁇ nch SAS/SATA drives or eight matching smaller SAS/SATA drives With eight smaller drives, a hybrid configuration such as 4 SAS and 4 SATA drives is possible
  • Each NEARdrive Mini has a conceptual or actual "buddy" (which may not be available)
  • Each NEARdrive Mini is responsible for its own 4 or 8 drives, but can, in a preferred embodiment, directly access the 4 or 8 drives of its buddy
  • the NEAR technology can be integrated with the SHADOWS RUBE 1 technology (described elsewhere), so that a working fluid is actively circulated through the NEARdrive (Blade or Mini) and among its components, in order to provide thermal stabilization and thereby minimize thermal stress
  • the boiling point of the working fluid is 34°C (at STP, and slightly higher at mildly elevated pressures, making the effective range approximately 34°C to 40 0 C)
  • the RUBE technology supplies an appropriate combination of liquid and vapor according to how much cooling or heating is needed
  • the NEARdrive components approach or exceed a target temperature, they act as a heat source and are efficiently cooled by the fluid as it changes phases, on the other hand, if the components drop below this temperature (such as when they are spun down), they act as a heat sink and are efficiently kept heated by the fluid as it phase-changes the other direction - thus, the fluid greatly increases thermal stability When drives are spun up after a period of non-use, they are already warmed up and ready to go without thermal stress
  • the NEARd ⁇ ve analysis of SMART data available from each disk drive focuses first on indicators whose "critical threshold" values were established with high confidence (>95%) by the Google study, as summarized in Table 8.5.6-1
  • CMU researchers analyzed the data from a 765-node supercomputer cluster with 3,060 CPUs, 3,060 DIMMs, 765 motherboards, 3,406 disk drives, and other components - over a 5-year period (the useful life of a disk drive)
  • the CMU researchers were able to draw a number of important conclusions (which are applied in a novel way by the NEEAR technology, as described in section 8 5 8)
  • Disk drive failures exhibit significant levels of autocorrelation and long-range dependence, so their statistical properties do not form a Poisson process as is commonly assumed
  • the failure rate in one time interval is predictive of the failure rate in the following time interval
  • Disk drive failures are not realistically modeled by an exponential distribution as is commonly assumed, but rather, are characterized by higher levels of variability and decreasing hazard rates (the empirical distributions are fit well by a Weibull distribution with shape parameter less than 1)
  • the decreasing hazard rate function predicts that the expected remaining time until the next failure grows with the time since the last failure " It is observed, for example, that right after a failure, the expected time until the next failure is around 4 days After surviving for 10 days without failures, the expected remaining time until the next failure grows from 4 days initially to 10 days After surviving a total of 20 days without failures, the expected time until the next failure grows to 15 days
  • the highly predictive non-zero SMART data counts referred to in section 8 5 6 and inferences from the findings summarized in section 8 5 7 both mandate and enable direct preventive action (rather than only remedial action) for the corresponding drive, in advance of drive failure
  • a drive can be (and should be) treated exactly as if it has just encountered an actual failure, with the exception the drive may not have actually failed yet, and such actual failure may in fact be preventable, or at least deferrable
  • a predicted or actual drive failure causes the data storage and retrieval responsibilities of the failed or "at- ⁇ sk" drive to be immediately shifted to other drives
  • load-shifting occurs for any failed drives first, and the relative risk among the at- ⁇ sk drives can be used to determine the load-shifting order among the at- ⁇ sk drives
  • the at- ⁇ sk drives may be prioritized by the relative risk apparently (but not actually) implied by their relative SMART Indicator values, with consideration given to other indicators and risk information that may be available
  • failed or at- ⁇ sk drives are left spun up, if possible, and then subjected to a pre- spin-down drive analysis and maintenance cycle as described in section 8 5 9
  • the drive is spun down the same as if no SMART errors or failure had been detected, and it is left in the normal drive rotation for later use
  • the drive is permanently de-powered and taken out of the normal drive rotation
  • the highly predictive non-zero SMART data counts referred to in section 8 5 6 can be used in conjunction with any observed failures to trigger an elevated risk level, by predicting the increased relative risk of disk drive failures, especially among those drives sharing one or more common dependencies (e g , same module, same thermal environment, same vibrational environment, same power environment, same EMP environment, etc )
  • this situation triggers a drive analysis and maintenance cycle for each of the drives in the collective group (e g , those in the affected drive rotation)
  • the drives in the collective group e g , those in the affected drive rotation
  • an automated pre-spin-down drive analysis and maintenance of every disk drive is executed on a periodic basis as described herein, and also executed when triggered in accordance with proactive risk management activity such as that described in 8 5 8
  • the aforementioned process can easily be combined with, or replaced by, other data recovery algorithms and processes known in the industry, as appropriate, in order to enhance the survivability of the data stored via the NEEAR technology, and without detracting from its most important properties, namely, the capability to recover from some number of otherwise unrecoverable errors, the capability to proactively prevent some number of errors that may otherwise be encountered, and the capability of accomplishing said error recovery and error prevention on a fully automated basis without encountering any unplanned downtime or interruption of service
  • a disk read error ( ⁇ e , one that has not yet caused the corresponding drive to be labeled as having failed) triggers an "on-the-fly" drive analysis and maintenance cycle that is limited to the immediate sector(s) corresponding to the read error, beginning with the errant sector, while also triggering the full drive analysis process as described in the previous section
  • the idea is to attempt to complete the current access, even if a delay is required Because every NEARd ⁇ ve stores only fragments of objects, and a sector error can affect at most one such fragment, the analysis-induced delay
  • the NFARdme recovery analysis is limited by the sensitivity and precision of the drive elcctionics, and thus, under normal circumstances, can recover only the pre ⁇ ious layer cannot add any latency to the overall storage/retrieval operation, except in the case where it corresponds to the "swing vote" ( ⁇ e , an extremely unlikely scenario involving multiple failures, where no other fragments are available to help complete the operation)
  • FACTUAL capability is a "memoization" system designed to operate at global scale and supercomputing speed, with the high levels of security and survivability commensurate with the SHADOWS infrastructure "Memoization” is essentially the capability of looking up known results of deterministic processes and/or functions rather than recomputing them from scratch
  • each SHADOWS artifact and each process has its own identity, whenever a deterministic process or function accepts a particular set of input values and produces and deterministic set of output values, we can treat the set of input values and the specific process identity as a new artifact having an identity of its own We can likewise treat the set of output values as an artifact, with an identity "Memoization” then becomes a conceptually simple matter of establishing a "pairing" between the input/process identity and the output identity, such that any already-known output can be looked up and identified Thus, given any input/process identity, it can be determined (through a lookup) whether the result has been previously computed, and if so, what its identity is
  • a SHADOWS native process has an input queue and an output queue, as depicted above, on the left Ignoring security issues, the input queue accepts tuples of the form ⁇ TxID, Operand ID List ⁇ , performs the work of the process which is to generate one or more Result IDs, then enqueue them for distribution
  • the transaction id (TxID) ties the Operand IDs (received as input) to the Result IDs associated with the processing results
  • the FACTUAL system works as depicted above, on the right, by intercepting the input queue O before it is seen by the aforementioned process ( ⁇ e , the one described earlier in section 7 3 1 , which is depicted above as a thumbnail illustration at ⁇ )
  • the input queue O accepts tuples of the form ⁇ TxID, Operand ID List ⁇
  • An interior lookup process immediately fires off requests ⁇ to find out if the current process (which must be deterministic) has previously computed a result for the Operand IDs, requesting that the results ( ⁇ e , a "hit” with results, or a "miss” with nothing) be sent it along to a specific destination team responsible for collecting results ⁇ (which may be the current team or some other one)
  • a message including the ⁇ TxID, Operand ID List ⁇ tuple is also sent to the specified "collect results" destination team ⁇ to put it on notice that data for the particular TxID may soon be arriving (unexpected data can trigger defensive behavior),
  • the team responsible for throttling the actual execution of the process receives the ⁇ TxID, Operand ID List ⁇ tuple, it queues immediately enqueues it internally to the co-located execution process input queue ⁇ If a "hit" message ⁇ occurs, the input queue ⁇ is checked If processing has not started, the entry is dequeued because the result is already known If the processing has already started, it is left to run, and its result can be verified against the known result In any case, the "hit" message ⁇ can be forwarded to the "Vet Results” process team ⁇ If the co-located execution completes, its results O are also forwarded to the "Vet Results” process team Validated results can be placed into the normal process output queue ⁇ If results were calculated that were previously not seen, they can be posted for update ⁇ , in order to support a future computation
  • Node A unit of reference in a data structure Also called a vertex in graphs and trees (2) A collection of information which must be kept at a single memory location
  • String A list of characters, usually implemented as an array Informally a word, phrase, sentence, etc Since text processing is so common, a special type with substring operations is often available Note
  • string usually refers to a small sequence of characters, such as a name or a sentence
  • text usually refers to a large sequence of characters, such as an article or a book
  • Tree (1 ) A data structure accessed beginning at the root node Each node (including the root) is either a leaf or an internal node An internal node has one or more child nodes and is called the parent of its child nodes All children of the same node are siblings Contrary to a physical tree, the root is usually depicted at the top of the structure, and the leaves are depicted at the bottom (2) A connected, undirected, acyclic graph It is rooted and ordered unless otherwise specified
  • Root The distinguished initial or fundamental node of a tree The only node which has no parent Parent.
  • the tree node conceptually above or closer to the root than a particular node (the child node) and which has a link to the child node
  • Leaf A node in a tree, but without any children In a FASTpage index, a leaf is also a "compound" node that may include both an external reference ( ⁇ e , the proper data of a "leaf) and also an internal reference ( ⁇ e , a "follow-on" reference to the next internal node) Note' Every node in a tree is either a leaf or an internal node. All FASTpage leaf nodes are allocated beginning with an ordinate of 255 and decrementing their ordinate value, and therefore exhibit stack-like growth behavior
  • FASTpageTM The idea behind FASTpageTM is to create a fast, highly scalable, associative memory mechanism that can adapt to the information to be remembered, in order to optimize both time and space
  • Each FASTpage implementation supports an arbitrary number of independent local search spaces, limited only by local storage capacity
  • Each FASTpage search space may be individually defined to be either transient or persistent, with individually specifiable survival requirements
  • FASTpageTM is a fast, efficient associative memory mechanism that can also persist indefinitely
  • persistence properties of a FASTpage index can be achieved any common data storage means, the concept is designed to capitalize on the very high performance of solid state disk (SSD) drives (and SSD-accelerated storage systems) in general, and the Scrutiny ® FIRE * and NEAR + technologies in particular
  • FASTpage indexes takes advantage of minimal NVRAM* resources and efficiently uses flash memory to amplify its high performance
  • flash memory can typically endure only a limited number of write-cycles is fully accommodated within the internal FASTpage mechanisms (which cannot approach such limits, by design), yet does not negatively impact FASTpage indexes in any way FASTpage indexes can also take full advantage of means that have no such limits
  • any FASTpage implementation can participate in higher level distributed architectures, such as SHADOWS and CHARM ⁇ (of which FASTpage is a component) Given a FASTpage implementation that is participating in such a distributed architecture, then each FASTpage search space can also participate, on an individually selectable basis, in a higher-level, distributed search space
  • FASTpage indexes can be implemented relatively easily in hardware or software, while avoiding the negative attributes of various traditional associative search mechanisms Nonetheless, the FASTpage mechanism was inspired by the respective individual benefits normally attributed to memory tries, binary search, binary trees, splay trees, ternary search trees, hash tables, distributed hash tables, and Bayer-tree variants " In particular, the ternary search tree (TST) serves as the conceptual jumping-off point for understanding the FASTpage search concepts It is assumed that the reader is generally familiar with the properties of these traditional search mechanisms
  • FASTpage indexes combine the properties of a ternary search tree (TST) and a digital search trie (spelled “Trie,” but pronounced “tree”), taking advantage of their ⁇ n-memory search performance, while adding persistence with a page-sized storage unit (typically a convenient multiple of a 512-byte sector)
  • TST concepts are central to FASTpage, especially the property that "not found" conditions in string searches are determined, on average, faster than equivalent searches with a hash table Also important is that, unlike hash tables, FASTpage requires no reorganization to accommodate growth
  • FIRE Fluorescence Index & Repository Emulator
  • FIREbladeTM or FIREdhveTM It provides high-performance all-electronic, long-term data storage that is immune to mechanical wear and vibration (including seismic events)
  • the stored data is safe from intruders even if stolen
  • the number of read/write accesses per second is orders of magnitude faster than hard disk drives
  • f NEAR Nearline Emulation & Archival Repository
  • NEAR Nearline Emulation & Archival Repository
  • NEAR Nearline Emulation & Archival Repository
  • NEARbladeTM or NEARdriveTM It provides high- capacity, electronically assisted long-term data storage that is subject to minimal mechanical risk (including wear, vibration, and seismic events), due to significantly reduced mechanical duty cycle
  • the stored data is safe from intruders even if stolen
  • the number of read and/or accesses per second is orders of magnitude raster than unassisted hard disk drives
  • NVRAM Non-Volatile Random Access Memory
  • NVRAM Non-Volatile Random Access Memory
  • ⁇ CHARM Compressed Hierarchical Associative Relational Memory
  • CHARM Counter-Healing Adaptive Distributed Organic Working Storage
  • SHADOWS Self-Healing Adaptive Distributed Organic Working Storage
  • FASTpage, CHARM, and SHADOWS are trademarks of Scrutiny, lnc "
  • B-trees The three most well-known variants of Bayer trees are commonly known as B-trees, B+trees, or Birees 3
  • a FASTpage TST requires only 4 bytes per internal node (1 byte for each of the four indices split, left, equal, right)
  • Each FASTpage page has sufficient page for exactly 256 nodes, so that each index can refer to any node
  • a typical FASTpage TST page requires only 1 KB of space of memory and/or storage
  • Trie algorithms are known for their extremely fast in-memory search speeds, but at the expense of explosive sparse memory requirements
  • the set of keys is sparse, i e when the actual keys form a small subset of the set of potential keys, as is very often the case, many (most) of the internal nodes in the Trie have only one child, which wastes memory FASTpage Trie pages usually start out as FASTpage TST pages, however, and these may be densely populated (each node requires only about 4 bytes on average, which is perhaps one-third the space of a classic TST node) When sufficient leading-byte diversity exists, the TST is converted to a Trie
  • a FASTpage TST page is converted to a FASTpage Trie page at the point when the leading byte of its first node corresponds to the lowest possible byte value (in the associated set of possible keys) and the number of keys with diverse but contiguously sequential leading bytes in the same node exceeds a threshold
  • Any nodes with non-contiguous leading bytes are allowed to remain in the Trie page, but are moved to their respective "proper" locations (based on their leading byte value)
  • FASTpage metadata can be embedded in each page, but in a preferred embodiment is stored elsewhere In particular, the metadata is maintained in a separate set of pageable, persistent storage (co-located with the corresponding FASTpage pages) that can be indexed by the FASTpage page number
  • Each FASTpage page has an associated metadata descriptor comprising a page type (e g , TST, Trie, etc ), page size (optional), node size (e g , 4 or 8 bytes), reference size (e g , 2, 4, 8, or 16 bytes), the most-significant portion held in "common" by the majority leaf node reference addresses, current indices for the stack and heap, access control & security barrier information, data validation information (optional), and one or more indicators of progress toward TST-to-T ⁇ e conversion
  • All of the FASTpage pages at a particular location may be subordinated to pages held elsewhere, such that the location itself (to an arbitrary level of detail) may comprise a portion of the actual search key
  • One effect of this is that of natural key partitioning
  • Another effect is storage space conservation by not having to store (or process) a portion of a key that is held in common by all the keys at a particular location
  • Leaf nodes are not fixed in size, but may consume 1 or 2 (or more) 4-byte entries, as required, in order to contain their variable-length external reference information (plus a 1-byte follow-on index to an internal node)
  • a single 4-byte entry contains a 1-byte internal index plus a 1-, 2-, or 3-byte external reference
  • a double 4-byte entry (8 bytes) contains a 1-byte internal index plus a A-, 5-, -6, or 7-byte external reference
  • a triple 4-byte entry (12 bytes) can contain up to an 11-byte external reference, and so on
  • numerical references are encoded with a variable-length unsigned LEB128 " binary number (1 bit per byte is a flag, with 7 bits per byte of numerical information, so each byte contributes a factor of 2 7 to the addressable range)
  • a FASTpage index is associative, it can serve wherever there is a need for an arbitrary n-to-m mapping (e g , 1-to-i , 1 -to-many, many-to-1 , many-to-many), which corresponds to a very large application space
  • n-to-m mapping e g , 1-to-i , 1 -to-many, many-to-1 , many-to-many
  • LEB128 is a relatively well known data format that refers to a 'Little Endian Base 128" integer with 128 possible values per byte
  • FASTpage index can be optimal, both in-memory and on disk
  • FASTpage keys are variable-length, and can be of any length, without penalty, so hierarchical file systems can be implemented without arbitrarily restricting the length of file names and directories (folders), and each directory (or folder) can contain any number of entries Because FASTpage keys are variable-length, without restriction, it is possible to implement path names of unlimited length, such that there is just a single index for an entire file system Nonetheless, a typical approach would be to implement a "nested" index where each directory (or folder) has its own secondary FASTpage index, because it offers a number of advantages (the discussion of which are outside the scope of this document)
  • a FASTpage index can be substituted wherever a traditional database index can be used, such as on an index-per-field and/or index-per-key basis, for each table With a FASTpage index, it is also quite reasonable to index EVERY field or column, rather than just a selected few, in every table, with a single database-level index
  • a single index can easily be used to subsume other indexes by prefixing each key value with both a table identifier and a field or column identifier* The table identifier, field/column identifier,
  • a FASTpage index is about the same speed as a hash table for a successful lookup but often much faster for an unsuccessful lookup, especially with long keys (this is important, because hash tables are often used to determine that a key is not present)
  • a FASTpage index can traverse key information in sorted order (forwards or backwards) and perform nearest match" searches Unlike a hash table, a FASTpage index never needs wholesale reorganization to account for growth
  • a FASTpage index maintains high-density key pages to optimize time and space, and becomes even more efficient over time as FASTpage TST pages are converted to FASTpage Trie pages
  • a key goal of disk-based indexes is to reduce the number of disk accesses required to locate a key, since disk access is usually a major performance bottleneck Accordingly, their disk-based index nodes tend to be "fat" and contain many keys, in order to minimum the number of fetches required Likewise, a FASTpage index contains many keys, but is much smaller and finer-grained by design, so as to be able to cache many more nodes that have a high probability of relevance Although a FASTpage index will potentially incur more disk accesses, one should expect less actual disk I/O overall, because of the higher probability that useful nodes are cached early on Also because flash memory is the primary FASTpage persistence mechanism (by design), in a preferred embodiment the FASTpage lookup rate in a very large database will easily exceed that of a typical disk-based index by as many as several orders of magnitude
  • each table identifier, and each field or column identifier would be a variable-length numeric value (typically only one byte) that is mapped to the corresponding table name, or field or column name, respectively
  • each and each unique value - is automatically factored out and stored only once
  • a multi-table search can be carried out easily, and tables not containing a particular field or column cannot contain the associated field or column identifier and thus can yield no relevant records
  • fields containing null values naturally occupy no space at all, and if none of the records in a table have a value for that field, even the identifier itself need not be stored (a search for that field, or values in that field, can yield no relevant records)
  • the key values associated with a particular field or column are defined to be UNIQUE (not duplicated)
  • the result of each successful record-oriented database index search is typically a record number or row number, for object-oriented databases, the result of each successful index search is typically an object
  • FASTpage indexes In addition to using one or more FASTpage indexes to replace traditional database indexes, they can also be used to achieve significant database compression (a technique used in Scrutiny's CHARM technology) The idea is to achieve compression by factoring out the "vocabulary" associated with a particular database
  • One way to achieve this is to create a non-duplicated index of key values comprising the vocabulary of the database, including at least all non-BLOB, non-numeric values, but possibly numeric values as well
  • the key value is prefixed by a data type identifier before index insertion (which means search keys need to be prefixed likewise)
  • a variable-length code e g , LEB128) is automatically assigned based on its predicted or actual likelihood of occurrence (frequency), such that the highest-frequency key values may be assigned the shortest codes " , and vice-versa
  • a reverse-mapping entry is also inserted Once the vocabulary is mapped, all database values can
  • a FASTpage index is well-suited to full-text indexing in general, and indexing of arbitrary content in particular, since its variable-length keys with leading compression provide a great deal of flexibility
  • Some of the same techniques described above for compressed databases are also directly applicable to content management, regardless of whether the content repository is a file system, a database management system, or something else
  • the vocabulary compression technique is particularly useful, since it also allows search keys to be mapped from vocabulary words to coded "concepts " Concept- coding and tagging can supplement simple text searching by incorporating thesauri and other external concept-oriented information that can help a searcher optimize precision and recall
  • External classifiers and reasoning engines can also contribute key pairs for a given chunk of content
  • multiple temporary FASTpage indexes can be used during content analysis and also during queries for quickly cross-matching and correlating interim results
  • a FASTpage index is well-suited to high-security applications for two primary reasons 1) designed-in, fine-grained access control, and 2) elimination of the need to retrieve the target records and/or objects to process queries and make security determinations MAC (mandatory access
  • a security barrier "token” can be inserted into the key just before a table identifier (e g , before each table identifier, or perhaps just one of them), and this would have the effect of "skipping over” any table that should be invisible to a particular query (based on the security context of the query itself)
  • Similar security barriers (which are tied to security policy) may be placed at other important locations within any key, and also within the area containing the target of any key, as well as being associated with all keys on a particular FASTpage index page and
  • Synchronous/Blocking vs. Asynchronous/Non-Blocking/Queued Interface The FASTpage processes can be implemented completely in software or hardware as an API (application programming interface) comprising synchronous (blocking) function calls or system calls
  • API application programming interface
  • a software implementation would comprise a set of asynchronous (non-blocking) message-oriented transactional services accessible via a queued messaging interface
  • a hardware implementation would comprise at least a non-blocking, transactional, packet-oriented queued interface such as might be implemented with PCI Express or HyperTransport (e g , with retrieval requests, posted writes)
  • FASTpage processes would be implemented in both software and hardware, due to their overall utility The idea is to standardize on FASTpage indexes and use them wherever they're applicable General purpose CPUs would use software implementations when appropriate (especially for temporary or transient indexes), but would also have access to hardware- accelerated implementations
  • the hardware implementations 1 would be a shared resource, accessible to multiple processors, and would largely be responsible for all persistent data
  • Leaf nodes require (vv/2) bytes in the normal case (1 byte for a follow-on internal reference and up to ((w/2)-1) bytes per short external reference, and up to (w-1) bytes per long external reference
  • the security ' banner is a special coded token that the FASTpage index can discern from the otherwise expected bytes in an index key sequence
  • security barriers are enabled at a particular level of granularity, there is a 1 -bit overhead to flag the security barrier at that level
  • a table-level barrier will incur an overhead of one bit per table, even if a particular table has no barrier Barriers are available for at least the following levels database, table, field or column, data type, vocabulary code, and target (e g , data or external reference)
  • multiple FASTpage processes would be instantiated within each PUMP (Parallel Universal Memory Processor) device (described elsewhere)
  • the PUMP device which would initially be implemented with reconfigurable logic (e g , FPGA) or "Structured ASIC "
  • Leaf nodes also require 4 bytes in the normal case (1 byte for a follow-on internal reference and up to 3 bytes of external reference), or 8 bytes if more addressing capacity is required (1 byte for a follow-on internal reference and up to 7 bytes of external reference)
  • the page-relative internal references are extended to w bits each (w>8), in order to allow for more nodes per page, and larger page sizes
  • the page-relative internal references are reduced to w bits each (w ⁇ 8), in order to allow for fewer nodes per page, and smaller page sizes
  • the page-relative internal references are specified on a per-page basis to w bits each, in order to flexibly and dynamically determine the nodes per page, and the page size, for a specific application scenario
  • each FASTpage index page has an associated offset (somewhat like a base address) that can be added to any external references to extend them
  • an LEB128-l ⁇ ke code would be used to identify tables, fields or columns, etc , within an index key, where one bit of an 8-b ⁇ t byte is used as a "stop" bit for variable-length values, with the consequence that only 7 bits per byte remain for data, yielding 2 7 or 128 possible values (hence the name)
  • the LEB128-l ⁇ ke code would be replaced with a similar but modified code where the first byte is special, by virtue of having a bit dedicated to flag the presence of a security barrier, and with the rest of the bytes, if any, being LEB128-l ⁇ ke, as before With the extra flag bit dedicated to the security barrier, the first byte can now take on only 2 6 or 64 possible values
  • the security barrier flag is NOT SET, it means that the byte sequence is NOT a security barrier, and is therefore processed normally
  • the security barrier flag IS SET it means that the first byte (
  • a spreadsheet is used to depict how a series of words is in inserted into a FASTpage index
  • a label indicates the word whose insertion is depicted, and the spreadsheet snippet depicts the content of the index
  • Load-balancing and other resource-sharing information is shared as typed block data in standard heartbeat messages
  • the information content and frequency of distribution varies according to a hierarchy of "granularity" that reflects the degree of locality most affected by the information
  • SHADOWS recognizes several hierarchical granularities of locality that can be configured as required to appropriately represent resource distributions, and comprising at least the following notional levels
  • the Machine which is a SCRAM node that is described elsewhere
  • the Machine comprises a set of Quadrants, each of which comprises a set of Lobes and an optional set of Blades, where each Lobe (and optionally any Blade) comprises at least one MASTER and typically at least one SLAVE, and both MASTERS and SLAVEs are typically multi-core processors
  • Sharing of intra-locality-specific load-balancing information occurs within each hierarchical granularity, and sharing of summarized load-balancing information occurs by pushing it to the next less-fine-grained level [Note In a preferred embodiment, this information sharing is implemented with secure multicasting wherever and whenever such multicasting is feasible, for efficiency, and with secure "simulated" multicasting otherwise ]
  • information sharing is more fine-grained ( ⁇ e , more detail and shared at a higher frequency) than across the Site containing the Machine, or the Neighborhood containing the Site, etc Similarly, within a "Neighborhood,” information sharing is more fine-grained than across the Community containing the Neighborhood, or the Region containing the Community, etc
  • Every Machine shares load and resource information with its Site at a relatively high frequency (compared to the summarization of the Site's information)
  • every Site shares load and resource information with its Neighborhood at a relatively high frequency, compared to the summarization of the Neighborhood's information, and so on
  • load-balancing information is fresher within a Machine than within a Site or Neighborhood, but fresher within a Neighborhood than within a Community, and so on
  • the information roll-up technique described here is not limited to load-balancing information, but is generally applicable to other information related to resource-sharing, and is especially useful for quantified classification of resource availability ( ⁇ e , relative capacity available rather than relative current load)
  • Resource information can also be much finer-grained than the available of a general resource such "computing capacity" -- it may extend, for example, to the capacity for handling a very specific task, or to the level of energy production, fuel reserves, network bandwidth, etc
  • the resulting information is particularly actionable when used in conjunction with "Think Globally, Act Locally” decision-making processes such as those used by the MASTER (described elsewhere) to determine its immediate propensity to offload tasks through delegation vs handling them locally vs volunteering to take on even more tasks
  • each Machine On a predetermined basis (in a preferred embodiment, round-robin turn-taking is used, determined by assigned time slot), each Machine also summarizes its Site's load info and shares it with both its Site ( ⁇ e , with all the peer Machines for whom it is summarizing information) and with its Neighborhood ( ⁇ e , with all the Sites that are peers of its Site) Each Machine takes a turn periodically, in order to amortize the overhead across the Site's multiple Machines) The summary includes a list (expressed or implied) of the Site's Machines and their Machine indices, ranked by quartile (in a preferred embodiment, quartile is used, but other classifications schemes are usable)
  • a multiplicity of peers at each level is responsible for each roll-up operation
  • Each such peer uses the same information basis to independently create a roll-up dataset (which should be identical to those created by the other n-1 participating peers)
  • Each of the n peers share only one slice, which means that the threshold value k (which may vary with context) determines how many correct slices have to be received to reconstruct the dataset
  • This technique (which, in a preferred embodiment, is also used in other contexts) contributes to Byzantine fault-tolerance, since up to n-k faulty contributors can be ignored (however, the SELF and BOSS subsystems take note of such failures)
  • An "untrusted node” such as an end-user PC or non-Scrutiny server (also referred to as the "subject machine") can be configured with SHADOWS software processes that enable its participation in the SHADOWS supercomputing infrastructure
  • Any PC or server machine, regardless of its "PC” or “server” label can serve as both a SHADOWS client (for one or more end-users) or as a SHADOWS server
  • SHADOWS agents agent processes
  • non-SHADOWS software e g , user applications
  • the user-local RUSHrouter further implements one or more MARSHAL roles that communicate with their assigned MARSHAL teams (more on this later) via one or more wide-area networks (WANs)
  • WANs wide-area networks
  • Each MARSHAL may also communicate with other "nearby" MARSHALS as specifically instructed by its MARSHAL team
  • the MARSHAL team communicates amongst itself and with other MARSHAL teams as necessary and permitted, in order to reach the WAN-facing
  • the SHADOWS infrastructure consists of a widely distributed cloud of dedicated "back-end” supercomputing and storage nodes (a discussion of which is beyond the scope of this brief), augmented by a collection of user-supplied computing and storage resources (“untrusted nodes")
  • the back-end supercomputing and storage nodes provide the resources necessary to achieve a basic level of service under the terms of a basic service level agreement (SLA)
  • a user can extend this basic service via "Quid Pro Quo" SLA, which provides a means to leverage the actual capacity of their local resources For example, a user typically consumes much less than 10% of the available computing capacity over a 24-hour period, leaving more than 90% unused, and therefore - wasted Under the terms of the Quid Pro Quo SLA, and with the support of the SHADOWS infrastructure, a user can not only take advantage of nearly 100% of the available capacity, but can do so when it is needed most
  • the combination of a Quid Pro Quo SLA and the SHADOWS infrastructure essentially allows a user to "bank" the unused or unneeded resources and recall them on demand
  • foreground processing refers to any SHADOWS processing that is performed locally to satisfy the immediate request(s) of a bona fide user
  • Background processing refers to any SHADOWS processing that is performed locally to satisfy the Scrutiny SHADOWS Quid Pro Quo SLA) associated with the subject machine
  • Communications processing refers to any local processing and communications required to satisfy the combined communications needs of foreground processing or background processing
  • End-user software applications can integrate with the SHADOWS infrastructure through a SHADOWS DELEGATE, which is essentially an application-specific proxy, or gateway, that provides the necessary interface
  • SHADOWS DELEGATE which is essentially an application-specific proxy, or gateway, that provides the necessary interface
  • the user-facing side of the DELEGATE implements one or more APIs (application programming interfaces) and/or protocols needed by the user's software applications (and to be provided by SHADOWS) Examples would include various file systems, version control systems, database managements systems, directory systems, email systems, instant messaging, VoIP, etc
  • each DELEGATE provides a single, minimalist user-facing API that implements a particular API and/or protocol
  • one DELEGATE might implement the MAPI email/messaging protocol, while another DELEGATE implements the IMAP email/messaging protocol If IMAP isn't needed, then the IMAP DELEGATE is not installed
  • the SHADOWS- facing side of the DELEGATE implements the SHADOWS RUSH protocol
  • the DELEGATE is essentially a protocol and data translation process interposed between a user's software application and the SHADOWS infrastructure
  • the user's software application depends on some functionality external to itself (which is "normally" provided by another local or remote software application or service), and it is the role of the appropriate DELEGATE to emulate that functionality
  • a DELEGATE need not be installed as a service on the subject machine -- it may run on-demand as an application For example, a software developer could locally run a "compiler" from a make file
  • the actual compiler can be replaced with a DELEGATE that is responsible for the compilation, but which securely communicates with the SHADOWS infrastructure to do the actual work and return exactly the same results that would have been returned by the local compiler, but with improvements in one or more dimensions (e g , less elapsed time, deeper analysis, etc)
  • FEC is also used to return actual results of a particular operation
  • FEC can be used (as instructed) to encode any results, which may be partially cached and partially stored, and — more importantly - only partially returned to the SHADOWS infrastructure (or forwarded elsewhere as requested)
  • each SERVANT sends only a fraction of the result, taking advantage of the almost-always-smaller-uplink-capacity associated with the subject machine, while taking also taking advantage of the almost-always-larger-downlink-capacity of the target recipient by receiving an aggregation of inputs from diverse SERVANTS
  • the ability to capitalize on asymmetric uplink/downlink capacities is particularly beneficial for communicating interim results among a large collection of SERVANTS, under the direct control of the SHADOWS infrastructure
  • Corrupt and/or unresponsive SERVANTS are easily detected and worked around by the combination of FEC and encryption, among other techniques SERVANTS are not provided with communication capability except that required to communicate with their local (interior) MARSHAL
  • One of the operations of a SERVANT process is to launch a subordinate SERVANT process and register it with its MARSHAL (the subordinate SERVANT must also register directly)
  • the subordinate SERVANT process "executable" is first created from cached and/or stored objects Oust like any other result), under the auspices of the SHADOWS infrastructure Once the synthesis is approved, it is then launched as a separate request (which also requires concurrence of the local RUSHrouter, which cannot withhold concurrence unless something isn't quite right)
  • the local MARSHAL may kill any local SERVANT process, and may also causes its own virtual machine to be restarted (by crashing itself, if need be) SERVANT processes may thus be created and deleted on the fly, but each is completely expendable
  • every aspect of a user's machine is expendable Background processing is always subordinate to foreground processing, and thus the SERVANT may relinquish control whenever there is foreground processing to do (which could be due to any DELEGATE,
  • RUSH When servicing users, RUSH is the only visible protocol after the user's application programming interface (API) up to and including the SHADOWS FLAMERouter On the user side, the API is terminated locally by the associated resident DELEGATE process, which serves as a stateful protocol translator (API to RUSH) and application gateway
  • the FLAMERouter on the server side has an embedded MASTER, and can fully terminate the RUSH protocol
  • the FLAMERouter's MASTER can communicate with other internal MASTERS using RECAP
  • the UNCAP protocol is tunneled through RUSH, such that RUSH is still the only visible protocol
  • the SHADOWS SERVANT natively uses the RUSH protocol, so no user API is involved
  • UNCAP which is tunneled over RUSH, is used to direct the SERVANT RECAP traffic occurs only between the MASTER and the FLAMERouter (which has an embedded MASTER), but does not propagate to RUSH
  • RUSH models all bidirectional links as two unidirectional links, because the presence of one direction (e g , reception) does not imply the correct functioning of the other (e g , transmission) This same principle holds for radio frequency and optical traffic used for RUSH communications
  • Any link that is defined in terms of one or more usage and/or capacity thresholds (e g , bursting limits, capacity windows, etc ) is modeled in SHADOWS as a set of related sub-links, each defined by its own set of thresholds Given a set of sub-links, the links must be labeled in ascending order of sequential use (e g , A1 is used before A2, etc )
  • the SHADOWS infrastructure dynamically characterizes the paths (especially the major ingress/egress points, and intermediate SHADOWS nodes) between its various sites, and creates a plan for each site that allocates one-way traffic (to each possible destination site) along its outbound links, in such a way as to globally optimize the use of each link
  • the allocation of traffic is dynamic in that it may be re-planned on a periodic or event-driven basis, but is almost always event-driven (expiration of a plan is considered an event, as is any major perturbation in SHADOWS network status)
  • Route planning generally strives to achieve several potentially conflicting goals Conflicts among goals may not occur (or may be of no consequence) as long as link utilization stays below some "bottleneck" threshold
  • One of the desirable side effects of route planning is to identify and monitor actual or potential inter-site bottlenecks in the SHADOWS network, and to recommend (or ideally, to execute) provisioning changes
  • the statistical path (and link) parameters comprise at least some combination of the following
  • Energy usage is a key consideration for survivability, especially for systems that must continue to operate when the utility power grid is down, and for systems that routinely operate off-grid
  • a given site may load-shed to an extreme degree, such as getting rid of all processing and storage tasks, leaving only communications (in order to avoid a potential network partition), there can still be a critical shortage of energy, especially if prolonged periods must be endured
  • SHADOWS and specifically, the RECAP and RUSH protocols
  • the RUSH protocol considers energy usage in its routing algorithms, such that during certain resource scenarios, routing occurs in such a way as to maximize network coverage while minimizing energy usage
  • the overriding goal is to
  • each SHADOWS site has multiple external communication channels that connect it directly or indirectly to all other SHADOWS sites via a diverse variety of networks (VLAN, WAN, WLAN, etc ) over a combination of optical, wired, terrestrial wireless, and satellite wireless links Normally, only a few of these links (as few as one) is needed to prevent a network partition In a strictly site-local energy crisis, survival of the site (preventing a network partition) may be a simple matter of choosing the link with the lowest power requirement However, in the case of a site whose presence is pivotal in keeping several other sites connected, the problem is more complex Such a problem could occur, for example, if the local site is the only survivor capable of connecting two parts of a mesh, and both of the parts depend on the local site for WAN connectivity In this case, the system uses the minimum number of links needed (at appropriate power levels) to safely prevent a network partition, and the determination of a minimalist configuration requires understanding of the present nearby network topology, since lower power
  • the power requirements for data transmission are determined and normalized into a cost per 512-byte packet * A table of representative costs is depicted below
  • a SHADOWS inter-node messaging plan is a means for globally optimal use of locally available network communications links, while dynamically adapting to a frequently changing network state
  • Inter-node messaging plans are created and maintained in quasi-real-time, and are sufficiently event-driven to account for network perturbations and state changes However, “event-driven” doesn't mean “message-
  • Each messaging plan accommodates the both the recommended outbound link capacity and maximum outbound link capacity of a particular node
  • actionable messages occur to request automated workload-shifting and/or provisioning requests
  • a messaging plan for one-way A-to-B communications is simply, for each QoS priority, a list of next-hop links (or sub-links) that are to be used, along with the percentage of data that is to be sent via each link
  • Example The messaging plan data structure for point A (the origin of a one-way communication from point A to point B) is conceptually similar to the following

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Multi Processors (AREA)

Abstract

La présente invention concerne une infrastructure de superinformatique répartie, fonctionnellement efficace, abordable, hautement fiable, durable et disponible pour traiter, partager et protéger à la fois des informations structurées et des informations non structurées. Un objet principal de l'infrastructure SHADOWS consiste à établir une plateforme extrêmement durable, partagée sans maintenance, pour un calcul extrêmement intensif (c'est à dire de superinformatique), la haute performance étant définie à la fois en termes de débit total, mais également en termes de temps de latence très faible (bien que chaque problème ou client ne nécessite pas forcément une très faible latence), tout en réalisant des niveaux sans précédent d'accessibilité. Selon sa forme la plus simple, l'idée est d'utiliser des 'équipes' réparties de nœuds dans un réseau à autorétablissement comme base de gestion et de coordination à la fois du travail réalisé et des ressources disponibles pour effectuer le travail. Le concept SHADOWS avec les 'équipes' est réactif pour sa capacité 'd'autorétablissement' et 'd'adaptation' de ses ressources réparties de manière organique. En outre, les 'équipes' mêmes sont au cœur de la prise de décisions, de traitement et de stockage de l'infrastructure SHADOWS. Tout ce qui est important est géré sous les auspices et la responsabilité d'une équipe.
EP08746717A 2007-04-23 2008-04-23 346infrastructure informatique Withdrawn EP2145362A4 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US91350207P 2007-04-23 2007-04-23
PCT/US2008/061346 WO2008131446A2 (fr) 2007-04-23 2008-04-23 Infrastructure informatique

Publications (2)

Publication Number Publication Date
EP2145362A2 true EP2145362A2 (fr) 2010-01-20
EP2145362A4 EP2145362A4 (fr) 2012-01-25

Family

ID=39876195

Family Applications (1)

Application Number Title Priority Date Filing Date
EP08746717A Withdrawn EP2145362A4 (fr) 2007-04-23 2008-04-23 346infrastructure informatique

Country Status (3)

Country Link
EP (1) EP2145362A4 (fr)
CA (1) CA2718136A1 (fr)
WO (1) WO2008131446A2 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10721680B2 (en) 2016-04-21 2020-07-21 At&T Intellectual Property I, L.P. Method and apparatus for providing a virtual network function in a network

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8291036B2 (en) 2009-03-16 2012-10-16 Microsoft Corporation Datacenter synchronization
US9552478B2 (en) 2010-05-18 2017-01-24 AO Kaspersky Lab Team security for portable information devices
US20130086267A1 (en) * 2010-09-24 2013-04-04 Bae Systems Plc Admission control in a self aware network
RU2494453C2 (ru) 2011-11-24 2013-09-27 Закрытое акционерное общество "Лаборатория Касперского" Способ распределенного выполнения задач компьютерной безопасности
EP2807347A2 (fr) * 2011-12-30 2014-12-03 Scrutiny, INC. Appareil dénommé « frame » (frame pour forced recuperation, aggregation and movement of exergy) permettant une récupération forcée, une agrégation et une circulation de l'exergie
RU2510074C2 (ru) 2012-02-24 2014-03-20 Закрытое акционерное общество "Лаборатория Касперского" Система и способ проверки исполняемого кода перед его выполнением
US9817876B2 (en) * 2015-06-29 2017-11-14 Planisware SAS Enhanced mechanisms for managing multidimensional data
CN107040411B (zh) * 2017-03-31 2020-11-13 台州市吉吉知识产权运营有限公司 一种基于事件驱动模型的智能网关管理方法及系统
US10706038B2 (en) * 2017-07-27 2020-07-07 Cisco Technology, Inc. System and method for state object data store
US10779342B2 (en) * 2017-11-27 2020-09-15 Cypress Semiconductor Corporation Load balance for dual interface automotive wi-fi controllers for P2P devices
US10802932B2 (en) 2017-12-04 2020-10-13 Nxp Usa, Inc. Data processing system having lockstep operation
CN108564249B (zh) * 2018-03-06 2022-02-15 华南理工大学 计及分布式光伏随机性的配电网置信削峰效益评估方法
US11657297B2 (en) * 2018-04-30 2023-05-23 Bank Of America Corporation Computer architecture for communications in a cloud-based correlithm object processing system
CN111726210B (zh) * 2019-03-21 2022-02-08 大唐移动通信设备有限公司 信息获取、发送方法、网络设备、终端及集中式网络配置
US10666427B1 (en) * 2019-06-11 2020-05-26 Integrity Security Services Llc Device update transmission using a bloom filter
EP3800912B1 (fr) * 2019-10-03 2023-07-19 Accenture Global Solutions Limited Informatique en zone périphérique protégeant la vie privée pour autorisation d'opérations sécurisées
CN110868076B (zh) * 2019-11-15 2021-08-10 中国舰船研究设计中心 一种用于船舶区域配电的dc-dc斩波装置
CN113852479B (zh) * 2020-06-28 2022-12-02 中移(成都)信息通信科技有限公司 一种安全网络构建方法、装置、设备和计算机存储介质
CN115277031B (zh) * 2021-04-13 2024-05-10 华为技术有限公司 一种数据处理的方法和装置
CN114893715B (zh) * 2022-04-02 2023-11-21 安徽宇航派蒙健康科技股份有限公司 加热控制方法及其装置、系统、计算机设备和存储介质
CN118508412B (zh) * 2024-05-06 2024-12-24 南京师范大学 一种分布式电源聚合商自组织共识、激励与选择方法
CN120934202B (zh) * 2025-10-14 2025-12-16 国网安徽省电力有限公司宣城供电公司 智能变电站二次设备在线重构控制方法及系统

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7562143B2 (en) * 2004-01-13 2009-07-14 International Business Machines Corporation Managing escalating resource needs within a grid environment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10721680B2 (en) 2016-04-21 2020-07-21 At&T Intellectual Property I, L.P. Method and apparatus for providing a virtual network function in a network

Also Published As

Publication number Publication date
EP2145362A4 (fr) 2012-01-25
WO2008131446A3 (fr) 2010-02-11
WO2008131446A2 (fr) 2008-10-30
CA2718136A1 (fr) 2008-10-30

Similar Documents

Publication Publication Date Title
US8706915B2 (en) Computing infrastructure
WO2008131446A2 (fr) Infrastructure informatique
Wang et al. Bft in blockchains: From protocols to use cases
US11171774B2 (en) System for synchronizing a cryptographic key state through a blockchain
CN104158879B (zh) 一种分布式数据中心云管理平台架构系统及方法
Hu et al. MNOS: a mimic network operating system for software defined networks
Li et al. A convergence of key‐value storage systems from clouds to supercomputers
Mahony et al. A systematic review of blockchain hardware acceleration architectures
Doku et al. Fusion of named data networking and blockchain for resilient internet-of-battlefield-things
Xu et al. Energy‐efficient big data storage and retrieval for wireless sensor networks with nonuniform node distribution
Ramya et al. SecDedoop: secure deduplication with access control of big data in the HDFS/hadoop environment
Xie et al. SAGIN-ID: A rapid intrusion detection method for space-air-ground integrated network based on smart contracts
Lai et al. Designing truly one-sided MPI-2 RMA intra-node communication on multi-core systems
Zhang et al. A Markov prediction-based privacy protection scheme for continuous query
Wu et al. Blockchain-based trusted avionics authentication for secure ground-to-air communication
Xuan et al. Energy Efficiency Opposition‐Based Learning and Brain Storm Optimization for VNF‐SC Deployment in IoT
Zhang et al. Concurrent regenerating codes
Alavizadeh et al. A distributed reliable collusion‐free algorithm for selecting multiple coordinators in IOTA using fog computing
Vance et al. Towards optimal incentive-driven verification in social sensing based smart city applications
Wang et al. ABFT: high-performance asynchronous byzantine fault-tolerant consensus algorithm for electricity data metrology
He et al. The Design and Implementation of Random Linear Network Coding Based Distributed Storage System in Dynamic Networks
Arafin Computation-in-Memory Accelerators for Secure Graph Database: Opportunities and Challenges
Fang et al. PlanetServe: A Decentralized, Scalable, and Privacy-Preserving Overlay for Democratizing Large Language Model Serving
Chavan et al. Cloud Security Solution: Fragmentation and Replication
Sinha Integration of Distributed Consensus Protocols for Scalable Sharded Blockchain Systems

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MT NL NO PL PT RO SE SI SK TR

AX Request for extension of the european patent

Extension state: AL BA MK RS

R17D Deferred search report published (corrected)

Effective date: 20100211

RIC1 Information provided on ipc code assigned before grant

Ipc: G06F 11/00 20060101AFI20100413BHEP

17P Request for examination filed

Effective date: 20100809

RBV Designated contracting states (corrected)

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MT NL NO PL PT RO SE SI SK TR

DAX Request for extension of the european patent (deleted)
A4 Supplementary search report drawn up and despatched

Effective date: 20111227

RIC1 Information provided on ipc code assigned before grant

Ipc: H04L 29/06 20060101ALI20111220BHEP

Ipc: G06F 9/50 20060101AFI20111220BHEP

Ipc: G06F 15/16 20060101ALI20111220BHEP

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20171103