COM & CORBA


Adapting Automation Arrays to the Standard vector Interface

Andrei Alexandrescu

STL can be extended in remarkable ways. Sometimes you can even lie about what's inside a container, and to good advantage.


Introduction

One of the great benefits of distributed object systems like COM and CORBA derives from their use of standard binary protocols and formats. It is these binary standards that enable us to use multiple languages in large applications. We can use the most suitable language for each part of the system. A project I was recently involved in, for instance, sported a three-tiered architecture using VB (Visual Basic) for the GUI, MS SQL Server for database access, and C++ for number crunching, all using the COM standard for communication (a quite common scenario).

Efficient data transfer is often very important in such applications. In our case, we needed to pass large two-dimensional matrices between various parts of the system. We chose Microsoft's Automation array format (known as SAFEARRAY) because it is a standard data type that's automatically marshaled between COM objects (in addition to being easy to manipulate from within VB). This article describes a complete, type-safe C++ wrapper for the Automation array API. This wrapper implements the std::vector interface as described in the Standard C++ Library, yet fully preserves the binary format of the Automation structure.

By preserving Automation's binary format, the implementation avoids costly format conversions while hiding the intricate SAFEARRAY API under elegant std::vector semantics. This wrapper helped us boost our productivity significantly. We could typically implement a business rule with only half the lines of code we would have needed using the raw SAFEARRAY API. Moreover, the code became much easier to follow, and all without a loss in efficiency — in fact, a slight gain was observed.

About VARIANTs and SAFEARRAYs

SAFEARRAYs are based on an important COM data structure known as a VARIANT. The VARIANT structure and its associated API implement the semantics of a dynamically-typed object. In a dynamically typed object like VARIANT, type is not an intrinsic attribute (as it is in C++). Rather, type has an explicit representation, and can thus be changed at run time. The following snippet of Visual Basic demonstrates such a type change to a dynamically typed object:

Dim MyVar as Variant
MyVar = 3
MyVar = "Hello" ' change the type
                ' and the value
                ' of MyVar

A VARIANT is able to hold an object of a given type at a given time. (There is a finite set of types). It is implemented as a structure known as a "tagged union," which is a union of all the types that can be held and an integer (the tag) that indicates which field of the union is valid.

An excerpt of VARIANT's implementation is given below:

enum VARENUM
{
   VT_EMPTY = 0, // empty VARIANT
   VT_NULL = 1,  // null
   VT_I2 = 2,    // two-byte integer
   VT_I4 = 3,    // four-byte integer
   VT_R4 = 4,    // four-byte floating-point type
   VT_R8 = 5,    // eight-byte floating point type

   ... other VARIANT type tags ...

   VT_ARRAY = 0x2000,

   ...

};

typedef struct  tagVARIANT
{
   unsigned short vt;
   WORD wReserved1;
   WORD wReserved2;
   WORD wReserved3;
   union
   {
      short iVal;
      long lVal;
      float fltVal;
      double dblVal;

      ... other VARIANT types ...

      SAFEARRAY *parray;

      ...

   };
} VARIANT;

Using such a structure is simple. Given a VARIANT, you look at its vt field (the tag). If the tag is VT_I4, for example, then the lVal field of the union is valid, and so forth. (In VB, all this is hidden in the runtime support. C++ programmers have to do this explicitly.)

There are also some flags defined in VARENUM. The one of interest here is VT_ARRAY, which enables a VARIANT to contain not only a single value, but also an array. It works like this: by convention, the tag (vt field) of a VARIANT containing an array is obtained by bitwise ORing VT_ARRAY (0x2000) with the tag of the type contained in the array. For instance, if the vt field of a VARIANT is VT_ARRAY | VT_I4 (0x2003), then the VARIANT contains an array of four-byte integers. Arrays of VARIANTs are also allowed (the corresponding tag is VT_ARRAY | VT_VARIANT). This last feature gives arrays the closure property: you can build structures of arbitrary complexity by using only arrays (e.g., arrays of VARIANTs which in turn are arrays, etc.).

The array itself is stored in the parray member of the union, which is of type pointer to SAFEARRAY. The SAFEARRAY structure is shown in Figure 1.

SAFEARRAY is a variable-sized structure. rgsabound has a length equal to cDims and stores the number of elements and lower bound for each dimension. (I deal just with one-dimensional SAFEARRAYs in this article. Multi-dimensional matrices can be implemented as arrays of arrays.) The SAFEARRAY structure is considerably flexible — it can hold any number of dimensions and any element type — but at the cost that everything about the type is decided at run time. There are functions in the SAFEARRAY API that manipulate this structure (e.g., SafeArrayCreate, SafeArrayGetElement, SafeArraySetElement, and so on), but they are clumsy and they don't add anything in terms of type safety.

Another issue is locking. The SAFEARRAY API is a palimpsest from the Win16 days, when memory paging was not available. If you wanted to access a chunk of memory, you had to lock a handle to it, get a pointer, use it, and eventually to release the lock. This requirement, although now obsolete for Win32 memory handles, remains valid for Automation arrays. The pvData member in the SAFEARRAY structure is valid only after you call SafeArrayLock — and you have to call it in pairs with SafeArrayUnlock.

The automation_vector Template Class

A partial listing of the automation_vector class, and its base class, automation_vector_base, appears in Figure 2. (For brevity, member functions such as begin, operator[], and insert, which enable automation_vector to implement the interface of std::vector, are not shown here. But they are all implemented for class automation_vector.) This wrapper class is based on a fundamental assumption: although data is passed around in the very flexible form of VARIANTs containing SAFEARRAYs, we have knowledge at compile time about the type that should be contained in those arrays. This assumption stems from the typical way a COM method is implemented in C++. The method takes a VARIANT parameter and coerces [1] it to the expected type before performing any operations on it (if the data is not already in the desired format). If conversion succeeds, processing continues with the known format, otherwise an error is returned.

The problem is that although the programmer knows what the data type will be after coercion, the compiler doesn't know what type it is beforehand, and therefore cannot help with simpler semantics, static checking, and so on. For simple VARIANT types, this is not much of an annoyance. You can directly access the fields in the union contained within the VARIANT. But to manipulate arrays in this way is overkill, even if you use the functions in the API. The goal in building the wrapper class was to enable syntax and semantics like the following in the implementation of a COM method:

HRESULT ComObject::Method
   (/* in, out */ VARIANT * pVar)
{
 automation_vector<int> Data;
   // Move Var into the vector.
   // Coercion may occur.
   // No data copying is performed
   Data.attach(*pVar);

   ... use Data as you would use
       std::vector<int> ...

   // Move the vector back into the
   // result. No coercion or copying is performed.
   Data.detach(*pVar);
   return S_OK;
}

The attach and detach operations facilitate the bidirectional transformation between the plain VARIANT needed by the COM interface and the automation_vector needed to perform the actual calculations. Note that no actual data is copied. Rather, during attachment a pointer to the data is copied between a dynamically typed variable (the VARIANT passed to attach) and a member variable within a statically typed automation_vector. Also, if conversion between C++ and Automation types is necessary, the conversion is done in-place, if at all possible. (This requires that the sizes of the C++ type and Automation type match — which is not always the case.) The foregoing methods are essential for avoiding unnecessary copying of large arrays. Of course, in some cases a copy is needed, so such a constructor is also provided.

Implementing an automation_vector

I can now identify the main problems to be solved in constructing an automation_vector wrapper:

Considering the above, the task of bridging one interface to another is anything but trivial. In the following discussion I address these issues one by one.

Relating C++ and Automation Types

A mapping between C++ types and the integral constants in the VARENUM enumeration must be defined. The problem is not trivial because, while the number of types supported by Automation is finite, the C++ types that can be represented by them are unbounded. This is because of the possibility of inheritance. Consider a VARIANT. It is common for frameworks to derive structures from VARIANT for enriched functionality. (It is fascinating to watch library writers scramble to discover the Holy Grail, the ultimate VARIANT wrapper.) For example, Microsoft provides COleVariant as part of MFC, CComVariant with ATL, and _variant_t, a member of the compiler COM support classes. Borland C++ Builder also defines two wrappers, OleVariant and Variant. You can build your own wrappers, and because all these derive from VARIANT (and don't change its binary format), you may legitimately want to store them in Automation arrays. In conclusion, the template class should accept not only VARIANTs, but also derivatives of VARIANT that don't add any data/virtuals to it. The same reasoning applies to the CURRENCY structure defined by Automation, or simple wrappers you might want to build around types like DATE (which is only a typedef for double). The constraining mechanism for types should be flexible enough to accept these cases. And as always, it would be best to enforce type safety at compile time.

To facilitate conversion from Automation to C++ types, I've put in place a template class that holds static information about Automation types. Its (non-type) template parameter is a VARENUM enumerated value. This template defines an enumeration [2], and a fixed-size array, as shown below:

template <VARENUM varenum>
struct static_variant_info
{       
   enum { vt = varenum };
   static char size_checker[
   varenum == VT_I1 ? 1
   : varenum == VT_I2 ? 2
   : varenum == VT_I4 ? 4
   : varenum == VT_R4 ? 4
   : varenum == VT_R8 ? 8
   : varenum == VT_CY ? 8
   : varenum == VT_BSTR ? 4
   : varenum == VT_DISPATCH ? 4
   : varenum == VT_UNKNOWN ? 4
   : varenum == VT_VARIANT ? 16
   : 0 ];
};

This template collects together the Automation type tag (the enumerated VARENUM value) and the size of that type, calculated at compile time with a bunch of invocations of the ?: operator. For instance, static_variant_info<VT_I4>::vt evaluates to VT_I4, or 3 (see the VARENUM definition above), and sizeof(static_variant_info<VT_I4>(). size_checker evaluates to 4.

The above template addresses half the problem. I also need a way to go from C++ types to VT_XXs. The binding is done through function signatures, as shown below. (Attention: non-intuitive code ahead!)

namespace Configure
{
   static_variant_info<VT_I1>
      deduceVARENUM(char);
   static_variant_info<VT_I1>
      deduceVARENUM(signed char);
   static_variant_info<VT_I1>
      deduceVARENUM(unsigned char);
   static_variant_info<VT_I2>
      deduceVARENUM(short);
   static_variant_info<VT_I2>
      deduceVARENUM(unsigned short);
   static_variant_info<VT_I4>
      deduceVARENUM(int);
   static_variant_info<VT_I4>
      deduceVARENUM(unsigned int);
   static_variant_info<VT_I4>
      deduceVARENUM(long);
   static_variant_info<VT_I4>
      deduceVARENUM(unsigned long);
   static_variant_info<VT_R4>
      deduceVARENUM(float);
   static_variant_info<VT_R8>
      deduceVARENUM(double);
   static_variant_info<VT_CY>
      deduceVARENUM(CURRENCY);
   static_variant_info<VT_BSTR>
      deduceVARENUM(BSTR);
   static_variant_info<VT_DISPATCH>
      deduceVARENUM(IDispatch *);
   static_variant_info<VT_UNKNOWN>
      deduceVARENUM(IUnknown *);
   static_variant_info<VT_VARIANT>
      deduceVARENUM(VARIANT);
   static_variant_info<VT_VARIANT>
      deduceVARENUM
         (automation_vector_base);
}

For each C++ type, there is a function that returns a static_variant_info parameterized with the corresponding Automation tag. Now, if from inside a template having T as a template parameter, the expression

Configure::deduceVARENUM(T()).vt

is evaluated, the compiler will:

The static_variant_info struct provides a means to translate C++ types to VT_ values. For classes derived from VARIANT, deduceVARENUM(T()) will resolve to deduceVARENUM(VARIANT). This is a result of the overloading rules that come into play during this process. Now say you define a new wrapper class COMDate that wraps the DATE type. To enable storage of COMDate values in an automation_vector, all you have to do is reopen namespace Configure and insert the appropriate function signature, like this:

namespace Configure
{
   static_variant_info<VT_DATE>
      deduceVARENUM(COMDate);
}

It is possible to check at compile time whether the automation_vector you've created is parameterized for an Automation-compatible type. It involves use of the static member variable static_variant_info::size_checker and a small template called static_checker. The latter is a template that evaluates to nothing if its template parameter is true, but produces a compile or link error otherwise. (For more information on static_checker, see the sidebar.)

The following line from the class definition of automation_vector checks whether the type represented by T is the correct size:

// If you have an error on the line
// below, you've instantiated
// automation_vector with the wrong type
static_checker<sizeof(T) ==
    sizeof(Configure::deduceVARENUM(T()).size_checker)>();

This compile-time assertion enforces the size requirement. For instance, although COleVariant and COleSafeArray are both derived from VARIANT, the latter will fail the test above, causing a compile-time error. I find it interesting that it's not the bodies of the deduceVARENUM functions that matter, only their signatures and return values. Like empty structures, they serve only as a means to leverage C++'s type system.

Attachment and Detachment

You can initialize an automation_vector the same way as a std::vector, using the constructor that takes a size and a fill element:

// Create a 10-element vector,
// each containing a real number
automation_vector<CComVariant> Array(10, CComVariant(0.0));

However, a major difference between automation_vector and std::vector is that the contents of the automation_vector are not always created from within C++ code. The data can be created outside the COM method and passed to it as a VARIANT. (Don't forget: a SAFEARRAY always lives in a VARIANT.) Thus, in some cases the elements of the vector must be thought of as already constructed, by another entity that transcends the C++ method (like Visual Basic). What's more, there are cases when you don't want to destroy the contents of the automation_vector, because the data will be passed outside the method through an output parameter.

For these reasons, automation_vectors are peculiar objects whose lifetimes must be carefully controlled. This issue can be seen as another facet of relating C++ types to those defined by a binary standard. When the automation_vector is constructed, the objects that will make up its elements already exist in a binary form, which may or may not fit the requirements imposed by the C++ type. A blind conversion copy would solve the problem because it would rely on constructor invocations, but this solution would be inefficient for large arrays. A more flexible scheme is necessary.

The scheme developed for automation_vector involves a process known as attachment. The following example illustrates this process.

Suppose you define a class MyClass that is binary compatible with VARIANT. The automation_vector constructor invoked within the method below makes the contents of the VARIANT available to the automation_vector:

HRESULT ComObject::Method(/* in, out */ VARIANT * pVar)
{
 typedef automation_vector<MyClass> TMyArray;
 TMyArray Array(*pVar, TMyArray::MOVE);
 ...
}

The automation_vector constructor is shown in Figure 3. MOVE is an enumeration that tells the constructor that the data in *pVar is not to be brought into the automation_vector by a copying process, but by a "move" process, which does not really move any data — only a pointer to the data. Moving is done through a call to function automation_vector::attach, which also appears in Figure 3.

If the automation_vector's element type has a binary layout compatible with the Automation type, the if ((vt == myVARENUM()) conditional within attach will evaluate to true. Within the if clause, a call is made to the automation_vector_base class's attach function, which also appears in Figure 3. Attachment is, at the core, a process of swapping data pointers between an automation_vector and a VARIANT. This happens in the base class's attach function.

Even though types may match enough to meet the vt == myVARENUM() conditional, they may still be be semantically incompatible. Suppose you had a PositiveInteger class with a single int as a member, and some accessors. It guarantees that its value is positive. Although it is binary compatible with Automation's four-byte integer type (denoted by VARENUM value VT_I4) it has to watch out that Automation doesn't hand it a negative integer — if it does receive a negative integer, it will have to take some special action, like throw an exception.

For this reason I've defined two coercion functions that give the automation_vector client a chance to coerce the Automation values to C++ values during the attachment process. For the type MyClass, the functions should have the following prototypes:

void from_automation(SAFEARRAY & Array, MyClass * pDummy);
void to_automation(SAFEARRAY & Array, MyClass * pDummy);

The first parameter is an array, and the second is a null pointer that is not actually used for any other purpose than type selection. The bottom line is, if you have a type that is binary compatible in terms of size, but not compatible in terms of semantics or bit layout, you have to write the two functions above.

In the common case no special action is necessary, so I've defined the default functions below:

inline void from_automation(SAFEARRAY &, void *)
{
}
inline void to_automation(SAFEARRAY &, void *)
{
}

If no other from_automation or to_automation functions are defined, the standard conversions come into action and these two functions are considered by the compiler. (A pointer to an object can be automatically converted to a pointer to void.) This arrangement provides both good defaults and flexibility.

What if the C++ and Automation types aren't compatible at all? Then execution follows the else clause in function automation_vector::attach. An example would be transforming a vector of strings (VT_BSTR) to a vector of PositiveIntegers (VT_I4). This situation requires much more corrective action than simply a call to from_automation. Now conversion becomes a two-step process. First, each string is converted to a COM integer using standard Automation conversions; the result is stored in a temporary Automation vector. Then the process starts all over again by attaching to that temporary vector. This time from_automation will be called to "transform" the raw COM integers into PositiveIntegers. Of course, this "totally incompatible" case does involve copying, not just a pointer swap — it can't be avoided.

There's not much to say about detachment. You call detach, passing it a VARIANT reference. The pointer to the automation_vector data is moved into the VARIANT and then the vector's copy of the VARIANT (which just holds a tag and a pointer) is cleared. The function to_automation is called during this process.

Locking and Unlocking

As I said before, the locking issue dramatically changes the semantics of SAFEARRAY when compared to std::vector. The procedure for accessing an element is: lock the array, get or set the element, and unlock the array. This is quite a bit different than operator[] in std::vector, which returns a reference that can be modified later. Fortunately, as hairy as the issue is, the solution is as simple: I keep the array always locked. While locked, the pvData member of the SAFEARRAY structure holds a pointer to the actual data. I rely on casts for accessing the elements. This is not as simple as it sounds. Remember that vectors can be constructed outside of C++ and can be part of other vectors. To get a crack on locking and unlocking, I took advantage of the from_automation and to_automation functions described above — I defined them for automation_vector<T> to lock and unlock the SAFEARRAY, respectively. So to say, automation_vector is the first client of the coercion mechanism defined by itself. By using this mechanism, you can have automation_vectors containing automation_vectors and so on. This was essential in our first application using automation_vectors, because we needed matrices.

Conclusion

The introduction of automation_vector as the workhorse of our project has had a positive impact on productivity. Matrix-intensive code was reduced in size by a factor of two, while becoming much easier to read. For this reason, although the implementation has been a tough endeavor (I remember I was about to scrap it twice), it was well worth the effort. I've tried to share this implementation with you so you won't have to pass through the Scylla and Carybda of wrapping a SAFEARRAY again. (Bug reports and suggestions are welcome.) I recommend use of automation_vector whenever you have to deal with single-dimensional Automation arrays. Besides, even if you work with other binary standards such as CORBA, I hope the techniques I've presented will inspire you. They apply whenever you have to communicate between C++ and a world that has a slightly different notion of what a type means.

Acknowledgements

Thanks to Peter Marino for initiating me in SAFEARRAY's intricacies. Thanks to Marc Briand and Scott Meyers, who provided an amount of feedback that convinced me to scrap the first version of this article and to rewrite it from scratch.

Notes

[1] In the context of this article, coercion means "in-place conversion." An element of a certain Automation type is altered into a C++ compatible type. The layout of the data may change, or certain semantic constraints may be applied to the data, but the location and amount of storage it occupies will not change.

[2] Technically, there should be a static constant in there, not an enum, but MSVC doesn't like the constants. I had to stick with "the enum hack."

Andrei Alexandrescu is a developer with Micro Modeling Associates, Inc's New York Component Solutions Group. He is responsible for application development using Visual C++, ActiveX technology, Visual SourceSafe, ODBC, and SQL Server. He may be reached at andrei@metalanguage.com.

Get Article Source Code