Tuesday 28 April 2009

Latency hiding patterns in CPUs

Latency is a major factor in CPU design. Most general purpose CPUs execute a sequence of instructions with the logical convention that one instruction completes before the next starts. However memory access latency and even the latency of pure computation bottleneck the achievable processing throughput. To circumvent this a number of techniques are used to parallelise computation and communication.

Conductors have capacitance and resistance and therefore take time to switch between voltage levels. This manifests as a propagation delay proportional to the length of the conductor. As CPU clock speeds increase, the time between each rising clock edge reduces and the maximum length of conductor that can be charged in that time shrinks. CPUs and other data-path designs must be designed to ensure that no signal propagation path is near the tolerance for propagation delay at the designed maximum clock speed.

In practice this means that there is a trade-off between clock frequency and circuit size. Large circuits can accomplish more per clock cycle, but cannot be clocked as high as smaller circuits. To continue increasing clock rates with a fixed feature size, a circuit must be broken up into smaller and smaller sub-circuits with synchronous boundaries. Even this is not enough in some cases as the clock signals themselves are skewed by the propagation latencies across the chip. In some cases this is solved by having islands of synchronous logic which communicate asynchronously as no synchronous clock can span the whole circuit.

So even within a chip, there can be a large, and growing difference between local and remote communication latency. Ignoring this simplifies designs but forces lowest common denominator performance. This trade-off between complexity and latency awareness and tolerance is repeated at many layers of the system stack.

Even within a synchronous logic island on a chip, there is latency to be hidden. Instruction fetch, decode and execute takes time and pipelining used in CPU designs to maximise throughput despite irreducible latency. The various steps of instruction decode, register selection, ALU activation etc. are split out with latches between so that the maximum propagation delay in any stage of the pipleine can be minimised. For example, a datapath with a 400 gate worst-case propagation delay could theoretically be split into 4 stages, each with a 100 gate worst case propagation delay, allowing a 4 times faster clock speed.

With pipelining in a CPU, the first instruction to be processed takes at least the same amount of time to be completed, but after that, a new instruction can be completed on every clock cycle - theoretically allowing 4 times as many instructions to be completed per unit time. In practice memory latency and bandwidth limits and dependencies between instructions mean that the pipeline can stall, or bubbles can form, reducing the achieved throughput. However, these problems can often be alleviated and pipelines have been very successful in CPU design with instruction fetch-execute-retire pipelines comprising as many as 40 or more stages with multiple instances of stages in super-scalar designs

So pipelining in general allows repetitive serial task processing to be parallelised with potential benefits :
  • Increased parallelism between instances of serial tasks
    Allowing greater throughput
  • Benefits of task-specificity (instruction and data caching benefits)
    Potentially improving efficiency

and the following caveats :
  • Dependencies between tasks must be understood and honoured
    Sometimes this reduces or eliminates parallelism
  • Pipeline stalls or bubbles due to job starvation or dependencies will reduce throughput
    The pipeline must be kept fed.
  • Achievable throughput is bottlenecked by the slowest stage
    As with a production line, the slowest worker sets the pace.
    As with a production line, slow workers can be duplicated to improve throughput.
  • Individual task latency can increase

Pipelining as a general technique can apply to any system that processes tasks of a regular form which can be split into stages. The SEDA system from Harvard is a general purpose server framework for implementing server processes as pipelined stages. The software setting allows more flexible and dynamic tradeoffs to be made between pipeline lengths and widths. It also offers a flexible way to separate asynchronous and synchronous steps.

Other interesting latency hiding techniques seen in CPUs are mostly associated with hiding memory access latency.

Latency hiding patterns

I want to write an entry about latency hiding patterns. Unfortunately my previous attempts became too long and boring even for me. This time I'm going to try something smaller and less rich to get the blog-flow more regular.

I'm interested in latency hiding patterns as I repeatedly see them being implemented at all levels of systems from the silicon to the top of the application stack. Often latency hiding techniques are what deliver better-than-Moore's law performance improvements and can have great impacts to usability as well as throughput and system efficiency.

I would describe latency hiding techniques as things that :
  • Maximise the value of communication
  • Maximise the concurrency of communication and computation
Communication value is maximised by avoiding communication, and when it cannot be avoided, minimising the overheads.
Communication and computation concurrency is maximised by maximising the independence of communication and computation. This requires understanding the dependencies between computation and communication.

Right, all very abstract. Let's hope some examples are more interesting.

Thursday 16 April 2009

Protel II

In the first post I described the basic Protel language. In the mid 1990s it was extended with object oriented capabilities. This was done in a number of phases and was tied into the development of a project at Nortel called Generic Services Framework (GSF). This was an object oriented reimplementation of 'call processing' on the DMS with wide scope and a huge development team. Nortel even created an 'Object Center' somewhere in Canada with a helpline that confused designers could call to get OO Protel advice. I suspect that today it would be a 'Center of Object Excellence'. I heard that the GSF project was not an unqualified success, but it did drive the evolution of Protel-2 which was used later in other products. In retrospect it seems that GSF and Protel-2 were more motivated by the Objects-with-everything zeitgeist than any particularly compelling benefits.

Protel-2 supports single inheritance with a common root class called $OBJECT. Methods can be explicitly declared to be overridable in base classes, similarly to C++'s virtual keyword. It does not support operator overloading, or overloading method names with different signatures. It supports fully abstract classes.
Like Eiffel, Protel-2 allows parameterless methods which return a value to be called without parentheses. This allows data members to be refactored to be read directly or via an accessor function method without changing the callers. I'm not sure how valuable this is in practice, especially as it does not affect assignment. Perhaps some Eiffel practitioners have experience of finding this useful? Protel-2 methods can be declared to be read-only with respect to the object instances they operate on, in a similar way to specifying const-ness in C++.

The GENERIC keyword in Protel-2 allows class definitions to be parameterised by type. This allows the creation of type-safe generic collection classes and datastructures. The type parameterisation is similar to the Generics mechanism in Java, in that it is effectively a compile-time-only mechanism. The compiler generates a single underlying class implementation and checks type-correctness at compile time. A side-effect is that all access to the parameterised class must be made via a pointer. This fits well with the requirement for online load-replacing implementation etc, but it offers much reduced power compared to C++'s code-generation style templating mechanisms.

As with most C++ implementations, Protel-2 objects contain a vtbl pointer in the first word of their data followed by data members of superclasses and the class instance. On XA-Core, this sometimes presented a problem in that the normal transactional memory ownership mechanism could create a bottleneck when used for the vtbl ptr, but often data in the rest of the class should use the transactional memory mechanism. To deal with this, special libraries were created to allow the object header to be stored in WRITEBLOCKING memory and the rest of the object to be stored in BLOCKING memory.

That's scraping the bottom of the barrel on Protel-2 information in my head. I thought there was more there (or maybe just something more interesting). At least I've written it down.

Sunday 12 April 2009

Protel I

Protel is the PRocedure Oriented Type Enforcing Language used for most DMS software. Currently there's not much information available about it online. An early paper describing the language is referenced here, but is hidden behind a subscription-only portal. Wikipedia offers a very minimal definition of the term but little else. I think there's some good stuff that should be recorded so I'll attempt to describe what I found interesting.

So what is Protel like? I'm told that it's similar to Modula-2 and even Modula-3 and it's true that it shares explicit BEGIN / END block syntax with Pascal, and all code is divided into modules.
One of the most basic differences between Protel and these languages is its use of the composite symbol '->' or 'Gozinta' (Goes into) for assignment. This eliminates any confusion between assignment and equality testing. This 'removal of ambiguity' is a key pattern in the design of Protel. Similar manifestations of the pattern are the rules :
  • No operator precedence rules
    Unlike C, Protel assigns no built-in relative precedence to various operators. All expressions are evaluated left-to-right, and the programmer must use brackets explicitly to specify non l2r evaluation.
  • No preprocessor
    There is no standard preprocessor and therefore no macro language. The compiler supports some limited compile time expression evaluation including sizeof() for types etc. This avoids context specific semantics for source code and non-visible code expansion
  • No support for pointer arithmetic
    Where a pointer is to be treated as referring to some element of an array, the descriptior mechanism, which includes bounds checking, is to be used rather than pointer arithmetic.
  • Control structure specific end-of-block keywords
    Rather than a single END keyword, Protel employs ENDBLOCK, ENDPROC, ENDWHILE, etc. to aid code readability.
  • No 'keystroke reduction mechanisms'
    C's ++, += etc.
These rules can make life harder when writing code and increase verbosity, but can aid readability and reduce the amount of non-local knowledge needed to understand code.

Protel modules contain one or more source code files which can export definitions for use by other modules. Different source files within a module can be arranged into trees which control compile order within a module. Modules often have multiple levels of interface source files - most public and general APIs in the top level, more private and specific APIs in the lower levels. Access to definitions in each interface file can be controlled independently if required.

Basic Protel supports built-in and user defined types, pointers, arrays, an 'array slice' descriptor mechanism, and a novel extensible fixed-size type called an Area as well as some type-reflection capabilities.

A Protel DESCriptor is used to refer to a range of elements in an array of . It is used in the same way an array - with a subsript as an lvalue or rvalue to an expression (though the usual meaning of those terms is confused by the Gozinta operator!). The compiler is aware of the of the slice being DESCribed, and by inference the size of the elements. In the storage allocated to the DESC itself, it stores a pointer to the zeroth element and an upperbound in terms of elements. In this way it can provide bounds checking on accesses through the DESC. When an out-of-bounds exception is hit, the actual upperbound and the supplied subscript are available in the exception report, often allowing debugging straight from the trace. The array slice abstraction can be a nice way to deal with zero-copy in a protocol stack.

Protel offers the BIND keyword which can be used to define a local-scope alias to some variable instance. It's use is encouraged to reduce keystrokes, and it is also useful for indicating to the reader and the compiler that some dereferencing operations need only be performed once even though the referenced value is used multiple times. A side effect of its use is that it reduces the tension between short, easily typed and long, descriptive variable names, allowing long descriptive names to be shortened in use when necessary. Of course this can add to ambiguity and confusion.

Protel supports typed procedure pointers with an explicit type classifier - PROCVAR. I suspect that this type classifier exists to improve code readability rather than for any syntactic necessity as PTR to PROC can be used similarly. PROCVARs are used heavily in SOS code to allow applications to override behaviours and extend OS behaviour. SOS has unique terms for the use of procedure pointers :
  • GATE
    SOS name for a Procedure Variable that is expected to be set by some other module. It is a 'gate' to some other implementation. Usually gates are defined in lower-level modules.
  • TARGET
    SOS name for a procedure implementation referenced from a GATE. This is the 'target' of a call to a 'gate'. Usually targets are defined in higher-level modules
  • ASPECT
    SOS name for a structure containing a number of procedure variables. This can be thought of as an 'interface' in the Java sense - a set of method signatures. Often a lower-level module would provide an API for a higher-level API to add some functionality including some data and perhaps an 'aspect' of procedure variables.

As well as supporting standard composite structures similar to a C struct, Protel supports the concept of an Area which can be used to implement a form of type-inheritance. An AREA is similar to a struct, but contains as part of its definition a storage size in bits, as well as zero or more members. At compile time, it is checked that the declared member's types fit within the size given, and instances will be created with the size given. Other modules can declare further areas which REFINE this area, and add definitions to the original AREA. The compiler will check that the complete set of member definitions continue to fit in the bit size of the original AREA. This mechanism can be used to create trees of hierarchically related data types which is very useful for code modularity and extensibility as well as more basic optionality similar to a C union. Putting procedure pointers into the Area gives a rather rough extensible virtual-method mechanism in-language. However, most DMS software designers used AREA refinements for hierarchically varying data rather than allowing control flow overrides.

Protel offers some reflection capability via the TYPEDESC operator. It is applied to a type and returns a structure which can be used to determine type names, bit offsets etc. SOS supports an online extensible data dictionary which uses TYPEDESC to track types and their relationships. It is version aware and is used by Table Control and for data reformatting between software releases.

Combining PROCVARs and REFINEd AREAs allowed extensible systems to be built fairly easily without OO techniques built into the language. However, the explicit nature of the PROCVARs, the requirement to define up-front bitsizes for refinable areas and the general micromanagement required to define, initialise and use 'object' hierarchies made from these components discouraged most designers from using them in this way. Providing tools at this atomic level encouraged each designer to try their own combination of hard coding, ProcVars, extensible areas, pointers-to-extension-structs, pointers-to-data-structs-with-procvars etc. More manual visualisation effort was required to grasp these mechanisms than would be required for an equivalent language with OO extensions.

I think PROTEL was fairly state-of-the-art when it was introduced for DMS software. Especially considering its planned use for telecoms switching equipment, it is a very general purpose language, not visibly oriented towards telecoms. It has a fairly clean split between language features and runtime libraries. Perhaps if it had been more widely known of beyond the confines of Nortel/BNR then it could have enjoyed some life of its own? I believe it is still being actively used - these days SOS images run in virtualised environments with code written by outsourced employees paid by a broken company, but I imagine there must still be patches getting written. However the outlook looks bleak. With Nortel on the rocks and apparently no interesting information about Protel available on the web (except this blog of course :) ), it looks like it could vanish after 30 years.

In the mid to late nineties, Nortel added object orientation to Protel, I'll talk briefly about that in a future post.