Data versioning with Google Protocol Buffers (GPB)

For the long term maintainability, interoperability, and extensibility of application data it is recommended that applications version the data they write using the different coordination services. Google Protocol Buffers (GPB) is the recommended versioning mechanism for these services that is supported by the SDK. The following section introduces GPBs and their use for message versioning with application’s model objects. It is recommended the reader reference the official GPB documentation to understand the complete syntax and all the features available for the programming language of choice for your application. Application developers that wish to make use of GPB in their designs will need to download and install Google Protocol Buffers (GPB) on their local development machine.

GPB is a strongly-typed Interface Definition Language (IDL) with many primitive data types. It also allows for composite types and namespaces through packages. Users define the type of data they wish to send/store by defining a protocol file (.proto) that defines the field names, types, default values, requirements, and other metadata that specifies the content of a given record. [50, 51]

Versioning is controlled in the .proto IDL file through a combination of field numbers and tags (REQUIRED/OPTIONAL/REPEATED). These tags designate which of the named fields must be present in a message to be considered valid. There are well-known rules of how to design a .proto file definition to allow for compatible versions of the data to be sent and received without errors (see Versioning Rules section that follows).

From the protocol file, a provided Java GPB compiler (protoc) then generates the data access classes for the user’s language of choice. In the generated GPB class, field access and builder methods are provided for the application to interact with the data. The compiler also enforces the general version rules of messages to help flag not only syntax and semantic error, but also errors related to incompatibility between versions of a message.

The application will ultimately use the Model Object it defines and maps to the GPB class that will be distributed. The conversion from Model Object to GPB object takes place in the custom serializer the programmer will have to write and register with the Coordination Service to bridge the object usage in the application and its distribution over the Coordination Services (See Application GPB Usage section that follows for more details).

Below is an example of a GPB .proto file that defines a Person by their contact information and an AddressBook by a list of Persons. This example demonstrates the features and syntax of a GPB message. String and int32 are just two of the 15 definable data types (including enumerated types) which are similar to existing Java primitive types. Each field requires a tag, type, name, and number to be valid. Default values are optional. Message structures can be composed of other messages. In this example we see that a name, id and number are the minimum fields required to make up a valid Person record. If this were version 1 of the message then, for example, version 2 could include an optional string website = 5; field to expand the record further without breaking compatibility with version 1 of the Person record. The Addressbook message defines a composition of this Person message to hold a list of people using the repeated tag.

		message Person {
		    required string name = 1;
		    required int32 id = 2;
		    optional string email = 3;
		
		    enum PhoneType {
		        MOBILE = 0;
		        HOME = 1;
		        WORK = 2;
		    }
		
		    message PhoneNumber {
		        required string number = 1;
		        optional PhoneType type = 2 [default = HOME];
		    }
		
		    repeated PhoneNumber phone = 4;
		}
		
		message AddressBook {
		    repeated Person person = 1;
		}

The protocol file above would be run through GPB’s Java compiler (see Compilation process for .proto files) to generate the data access classes to represent these messages. Message builders would allow new instances of the message to be created for distribution by the Coordination Services. Normal set/get accessor methods will also be provided for each field. Below are examples of creating a new instance of the message in Java. Reading the record out will return this GPB generated object for the application to interact with as usual.

public class AddPerson {

    // This function creates a simple instance of a GPB Person object 
    // that can then be written to one of the Coordination Services.
    public static Person createTestPerson(){

    // Initial GPB instance builder.
    Person.Builder person = Person.newBuilder();

    // Set REQUIRED Person fields.
    person.setName(“John Doe”);
    person.setId("1234”);

    // Set OPTIONAL Peson fields.
    person.setEmail(“john.doe@gmail.com”);
 
    // Set REQUIRED Phone fields.
    Person.PhoneNumber.Builder phoneNumber = 
		   Person.PhoneNumber.newBuilder().setNumber(“555-555-5555”);

    // Set OPTIONAL Phone fields.
    phoneNumber.setType(Person.PhoneType.MOBILE);
 
    person.addPhone(phoneNumber);
    }
    return person.build();
}

GPB versioning rules

A message version is a function of the field numbering and tags provided by GPB and how those are changed between different iterations of the data structure. The following are general rules about how .proto fields should be updated to insure compatible GPB versioned data:

  • Do not change the numeric tags for any existing (previous version) fields.

  • New fields should be tagged OPTIONAL/REPEATED (never REQUIRED). New fields should also be assigned a new, unique field ID.

  • Removal of OPTIONAL/REPEATED tagged fields are allowed and will not affect compatibility.

  • Changing a default value for a field is allowed. (Default values are sent only if the field is not provided.)

  • There are specific rules for changing the field types. Some type conversions are compatible while others are not (see GPB documentation for specific details).

Note: It is generally advised that the minimal number of fields be marked with a REQUIRED tag as these fields become fixed in the schema and will always have to be present in future versions of the message.

Compilation process for .proto files

The following is a description of the process by which .proto files should be defined for an application and compiled with the Java GPB compiler, and how the derived data classes should be imported and used in application code. The following description applies GPB 2.5.0v:

Compiling and installing the protoc binary

The protoc binary is the tool used to compile your text-based .proto file into a source file based on the language of your choice (Java in this example). You will need to follow these steps if you plan on being able to compile GPB-related code.

  1. Download the full source of Google's Protocol Buffers. For this example we are using 2.5.0v in the instructions below.

  2. Extract it somewhere locally.

  3. Execute the following commands:

    cd protobuf-2.5.0

    ./configure && make && make check && sudo make install

  4. Add the following command to your shell profile, then execute the command:

    export LD_LIBRARY_PATH=/usr/local/lib

  5. Try to run the protoc standalone to verify protoc is in your path and the LD_LIBRARY_PATH is set correctly. Running protoc on the command line should return Missing input file. if everything is set up correctly.

Compiling .proto files

We recommend under the project you wish to define and use GBP you place .proto files under the /src/main/proto directory. You can then make use of the GPB option java_package syntax to control the subdirectory/package structure that will be created for the generated Java code from the .proto file.

The projects pom.xml file requires the following GPB related fields:

		    <dependencies>
		        <dependency>
		            <groupId>com.google.protobuf</groupId>
		            <artifactId>protobuf-java</artifactId>
		            <version>2.5.0</version>
		        </dependency>
		    </dependencies>
		
		    <build>
		        <plugins>
		            <plugin>
		                <groupId>org.apache.maven.plugins</groupId>
		                <artifactId>maven-compiler-plugin</artifactId>
		                <version>2.3.2</version>
		                <configuration>
		                    <source>1.7</source>
		                    <target>1.7</target>
		                </configuration>
		            </plugin>
		 
		            <plugin>
		                <groupId>com.google.protobuf.tools</groupId>
		                <artifactId>maven-protoc-plugin</artifactId>
		                <version>0.3.2</version>
		                <executions>
		                    <execution>
		                        <goals>
		                            <goal>compile</goal>
		                        </goals>
		                    </execution>
		                </executions>
		            </plugin>
		        </plugins>
		    </build>

After running mvn clean install on the pom.xml file GPB’s protoc will be used to:

  • Generate the necessary Java files under the following directory:

    ./target/generated-sources/protobuf/java/<optional java_package directory>

  • Compile the generated Java file into class files

  • Package up the class files into a jar in the target directory

  • Install the compiled jar into your local Maven cache (~/.m2/repository)

Have the .proto file and generated .java file displayed properly in your IDE from your project’s root directory, i.e. where the project’s pom.xml file is, execute the following:

  • mvn eclipse:clean

  • mvn eclipse:eclipse

  • Refresh the project in your IDE (Optional: clean the project as well).

Because the resulting Java file is protoc generated code it is not recommended that it be checked in to your local source code management repo but instead regenerated when the application is built. The GPB Java Tutorial link on the official GPB website gives a more in depth walk through of the resulting Java class.

Application GPB usage

Generated GPB message classes are meant to serve as the versioned definition of data distributed by the Coordination Service. They are not meant to be used directly by the application to read/write to the various Coordination Services. It is recommended that a Model Object be defined for this role. This scheme provides two notable benefits:

  1. It allows the application to continue to evolve without concern for the data versioning at the Coordination Service level.

  2. It allows the Model Object to define fields for data it may want to store and use locally for a version of the data but not have that data shared during distribution

The recommended procedure for versioning Coordination Service data is shown below and the sections that follow explain each of these steps with examples and best practices.

  1. Define a POJO Model Object for the data that the application will want to operate on and distribute via a Coordination Service.

  2. Define a matching GPB .proto Message to specify which field(s) of the Model Object are required/optional for a given version of message distributed by the Coordination Services.

  3. Implement and register a Custom Serializer with the Coordination Service that will convert the Model Object the application uses to the GPB message class that will be distributed.

Model objects

The application developer will define POJOs (plain old Java objects) for his/her application. They will contain data and methods necessary to the applications processing and may contain data that the application wishes to distribute to other members of the controller team. Not all fields may need to be (or want to be) distributed. The only requirement for the Model Object’s implementation is that the class being written to the different Coordination Services implement com.hp.api.Distributable (a marker interface) to make it compatible with the Coordination Service.

In terms of sharing these objects via the Coordination Service, the application developer should consider which field(s) are required to constitute a version of the Model Object versus which fields are optional. Commonly those fields that are defined in the objects constructor arguments can be considered required fields for a version of the object. Later versions may add additional optional fields to the object that are not set by a constructor. New required fields may be added for new versions of the Model Object with their presence as an argument in a new constructor. Note that adding new required fields will require that field for future versions. Past versions of the application that receive a new required field will just ignore it. Overall, thinking in terms of what fields are optional or required will help with the next step in the definition of the GPB .proto message.

The following is an example of a Person Java class an application may want to define and distribute via a PubSub Message Bus. The name and id fields are the only required as indicated with the constructor arguments. The application may use other ways to indicate what required fields are.

public class Person implements Distributable {
    private String name;
		private int id;
		
		private String email;
		private Date lastUpdated;
  
		Person(String name, Id id) {
		    this.name = name;
		    this.id = id;
		}
		// Accessor and other methods.
}

GPB .proto message

The GPB .proto message serves as the definition of a versioned message to be distributed by the Coordination Service. The application developer should write the .proto messages with the Model Object in mind when considering the data type of fields, whether they are optional or required. etc. The developer should consider all the GPB versioning rules and best practices mentioned in the previous section. The programmer implements a message per Model Object that will be distributed following the GPB rules and conventions previously discussed.

Below is an example .proto message for the Person class. The field data types and REQUIRED/OPTIONAL tags match the Model Object. Since email was not a field to be set in the constructor it is marked optional while name and id are marked as required. Notice that the lastUpdated field of the Model Object is not included in the .proto message definition. This is considered a transient field, in the serialization sense, for the Model Object and it is not meant to be distributed in any version of the message. With this example the reader can see not all fields in the Person Model Object must be defined and distributed with the .proto message.

option java_outer_classname = "PersonProto;	// Wrapper class name.

		message Person {
				required string name = 1;
				required int32 id = 2;
				optional string email = 3;
		}

The application developer will generate the matching wrapper and builder classes for the .proto message to have a Java class that defines the message using protoc in the .proto Compilation Process section above.

Custom serializer

Finally, a custom serializer needs to be defined to translate between instances of the Model Object being used in the Coordination Services and instances of the GPB message that will ultimately be transported by that service. For example, we may wish to write the Person Model Object on the PubSub Message Bus and have it received by another instance of the application which has subscribed to Person messages through its local Coordination Service.

In the custom serializer the developer will map the fields between these two objects on transmit (serialization) and receive (deserialization). With data types and naming conventions it should be clear what this 1:1 mapping is in the serializer. The Serializer must implement the Serializer<Model Object> interface as shown in the example below. It is recommended this serializer be kept in the <application>-bl project (if using the provided application project generation script of the SDK). PersonProto is the java_outer_classname we define in the GPB message above and will be the outer class from which inner GPB message classes, and their builders, are defined.

import <your package>.PersonProto;

public class PersonSerializer implements Serializer<Person> {

    @Override
		public byte[] serialize(Person subject) {
		    PersonProto.Person.Builder message = PersonProto.Person.newBuilder();
		    message.setName(subject.getName());
		    message.setId(subject.getId());  
		    return message.build().toByteArray();
		}

		@Override
		public Person deserialize(byte[] serialization) {
		    PersonProto.Person message = null;
		    try {
		         message = PersonProto.Person.parseFrom(serialization);
		    } catch (InvalidProtocolBufferException e) {
		        // Handle the error
		    }

		    Person newPerson = new Person();
		    if (message != null) {
		        newPerson.setName(message.getName());
		        newPerson.setId(message.getId());
		        return newPerson;
		     }
		    return null;
		}
}
		

In the serialize() method the builder pattern of the generated GPB message class is used to create a GPB version of the Person Model Object. After the proper fields are set the message is built and converted to a byte array for transport. In the deserialize() method on the receiver the byte array is converted back to the expected GPB message object. An instance of the Model object is then created and returned to be placed into the Coordination Service for which the serializer is registered.

The application must register this custom serializer with the Coordination Service it wishes to use this Model Object and GPB message combination. Below is an example of that registration process in an OSGI Component of an example application.

@Reference(cardinality = ReferenceCardinality.MANDATORY_UNARY, 
policy = ReferencePolicy.DYNAMIC)
protected volatile CoordinationService coordinationSvc;

@Activate
public void activate() {
	  // Register Message Serializers
		if (coordinationSvc != null) {
		    coordinationSvc.registerSerializer(new PersonSerializer(),Person.class);
		}
}