DataRaccoon

Problem Intro


Typically, JSON format is used to send data across systems. If you are in python land, you probably do things in a dictionary, serialize/unserialize it with json.dumps and json.loads respectively. If you are using a webframework such as flask, you would use jsonify and the get_data() method.

For example,

import json
output = {'account_id': 1234, 'sales_amount': 40.0}
json_output = json.dumps(output)
type(json_output) #str
print(json_output)#{"account_id": 1234, "sales_amount": 40.0}

##### Using Flask #### 

from flask import Flask
from flask import jsonify

app = Flask(__name__)

with app.app_context():
    json_output = jsonify(output)

print(json_output) #<Response 40 bytes [200 OK]>
print(json_output.get_data()) #b'{"account_id":1234,"sales_amount":40.0}\n'

input = (json_output.get_data())
print(json.loads(input)) #{'account_id': 1234, 'sales_amount': 40.0}

Problem 1

Using JSON poses quite a few challenges, the sending/recieving party does not need to enforce the data types, making it hard to parse further downstream. For instance,

  • account_id can be in string format "1234"
  • sales_amount can be int 40

(sounds familar?)

Problem 2

Among us data folks, we usually referrence the dictionary with keys. e.g data.get('sales_amount'). What if additional fields is required, or change schema naming from sales_amount to total_sales. Then downstream code referrence such as data.get('sales_amount') would fail.

In production ML systems making predictions, changing the schema naming convention would yield different results. This results in release cycles getting more complicated and systems being more tightly coupled.

Problem 3

Furthermore, in a typical data lake (which are very expensive to build and maintain), everything is usually dumped as json blobs no consistency, resulting in a significant technical debt, and data cleaning will definitely be painful. It is no wonder - data scientist spend more than 80% of their time cleaning data.

Ouch!

Introducing protobuf


Lets come up with a mock problem, that we are trying to send / parse the following data:

fields data type
account_id int
sales_amount float
method enum

For those unfamiliar, enum is factor variables. For example, in this case, the method can only take values of CASH, CREDIT_CARD, WALLET.

You can read the google guide on protobufs1. Essentially, protobuf generates a python module/script from a proto file.

In this example, we define events.proto which you can use protoc to generate events_pb2.py.

Note, if you are interested in following along, you may go through the hands on section. Otherwise feel free to skip to the next section.

Hands on - setup (Optional)


Installation & Setup

After finishing the steps in the tabs, this is how your directory should look like:

.
├── events.proto
├── events_pb2.py
└── script.py (or ipynb)

You can also choose to use juypter notebook instead of python script.

As of this writing, the latest protobuf version is 3.12.3.

We are now good to go!

For this guide, we will be using Anaconda

Prepare python environment

conda create -n stream python=3.7
conda activate stream
conda install protobuf==3.12.3 flask==1.1.2

If you are not using Anaconda, and prefer native python instead:

sudo pip install google protobuf
pip install flask

For installation guides, please refer to the google documentation2 or if you are in a hurry:

Using a mac:

brew install protobuf

If you prefer not to use brew,

PROTOC_ZIP=protoc-3.12.3-osx-x86_64.zip
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v3.12.3/$PROTOC_ZIP
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
rm -f $PROTOC_ZIP

Using linux,

PROTOC_ZIP=protoc-3.12.3-linux-x86_64.zip
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v3.12.3/$PROTOC_ZIP
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
rm -f $PROTOC_ZIP
syntax = "proto2";
package tutorial;
message PaymentInfo {

  required int32 account_id = 1;
  optional float sales_amount = 2;

  enum payment_method {
    CASH = 0;
    CREDIT_CARD = 1;
    WALLET = 2;
  }

  optional payment_method method = 3;
}

In your current directory

INPUT_FILE=events.proto
SRC_DIR="$(pwd)"
DST_DIR=$SRC_DIR
protoc -I=$SRC_DIR --python_out=$DST_DIR $SRC_DIR/$INPUT_FILE

You should see an events_pb2.py output.

Using Protobuf


import events_pb2 #created by above steps
payment = events_pb2.PaymentInfo()

payment.account_id = 1234
payment.sales_amount = 142.0
payment.method = 1 #CREDITCARD

print(payment)
"""
account_id: 1234
sales_amount: 142.0
method: CREDIT_CARD
"""
message = payment.SerializeToString()
print(message) #b'\x08\xd2\t\x15\x00\x00\x0eC\x18\x01'
type(message) #<class 'bytes'>

# Downstream user recieving the message:

payment_proto = events_pb2.PaymentInfo()
payment_proto.ParseFromString(message)
payment_proto #same output as above

from google.protobuf.json_format import MessageToJson

json_msg = MessageToJson(payment_proto)
print(json_msg)
"""
{
  "accountId": 1234,
  "salesAmount": 142.0,
  "method": "CREDIT_CARD"
}
"""

Your message is serialized to '\x08\xd2\t\x15\x00\x00\x0eC\x18\x01'!

Features

Python Object

  • You can access the features as an object. For instance:

    payment_proto.account_id #1234
    payment_proto.method #1

Data Size

  • You can reduce the size of the message being sent

    # compare sizes
    import sys
    sys.getsizeof(message) #43
    sys.getsizeof(json_msg) #123

Wrong data type

  • What if you define data wrongly?

    payment = events_pb2.PaymentInfo()
    payment.account_id ="123"
    Traceback (most recent call last):
      File "<input>", line 1, in <module>
    TypeError: '123' has type str, but expected one of: int, long

    or miss out a field?

    payment = events_pb2.PaymentInfo()
    payment.sales_amount = 142.0
    payment.SerializeToString()
    Traceback (most recent call last):
      File "<input>", line 3, in <module>
    google.protobuf.message.EncodeError: Message tutorial.PaymentInfo is missing required fields: account_id

Changing Schema


Now, as your business grows, you start introducing a few more features, and would like to change your data model.

fields data type
account_id int
sales_total float
sales_currency enum
method enum
cart_id int

Notice that sales_amount has now been changed to sales_total. You will then define the new schema, and generate the python module again. For the purpose of a later demostration, we will keep the old python module and rename it to events_old_pb2.py.

Similarly, you may choose to follow the hands on or skip to the next step.

Hands on - Changing Schema (Optional)

Rename your old proto from events_pb2.py to events_old_pb2.py

mv events_pb2.py events_old_pb2.py

and the events.proto should look like:

syntax = "proto2";

package tutorial;

message PaymentInfo {

  required int32 account_id = 1;
  optional float total_sales = 2;

  enum currency {
    SGD = 0;
    USD = 1;
  }

  optional currency sales_currency = 4;

  enum payment_method {
    CASH = 0;
    CREDIT_CARD = 1;
    WALLET = 2;
  }

  optional payment_method method = 3;
  optional int32 cart_id = 5;

}

Notice the changes from sales_amount to total_sales

Forward compatibility


Let's use the same old message in the earlier example: b'\x08\xd2\t\x15\x00\x00\x0eC\x18\x01'

Note: everytime you import a protobuf module, you need to restart the python kernel (or start a new one)

import events_pb2

new_parser = events_pb2.PaymentInfo()
old_message = b'\x08\xd2\t\x15\x00\x00\x0eC\x18\x01'

new_parser.ParseFromString(old_message)
new_parser

"""
account_id: 1234
total_sales: 142.0
method: CREDIT_CARD
"""

the sales_amount automatically converts to total_sales!

Backward compatibility


Now, lets define a new event with the new schema (along with some new tricks).

import events_pb2
from google.protobuf.json_format import MessageToJson

new_parser = events_pb2.PaymentInfo()

new_parser.account_id = 123
new_parser.total_sales = 1000.0
new_parser.sales_currency = new_parser.currency.Value('USD')
new_parser.method = new_parser.payment_method.Value('CASH')
new_parser.cart_id = 1
print(new_parser)
"""
account_id: 123
total_sales: 1000.0
method: CASH
sales_currency: USD
cart_id: 1
"""
new_message = new_parser.SerializeToString()
print(new_message) #b'\x08{\x15\x00\x00zD\x18\x00 \x01(\x01'

#Extra info
import sys
sys.getsizeof(new_message) #46
sys.getsizeof(MessageToJson(new_parser)) #156

Restart your python kernel, and using the old protobuf module,

import events_old_pb2

old_parser = events_old_pb2.PaymentInfo()
new_message=b'\x08{\x15\x00\x00zD\x18\x00 \x01(\x01'

old_parser.ParseFromString(new_message)
print(old_parser)

"""
account_id: 123
sales_amount: 1000.0
method: CASH
"""


This way, up/down stream teams can have a centralize protobuf file to communicate, and systems are decoupled. The upstream services can make changes to the schema without fear of breaking downstream systems. Correspondingly, the downstream services can update their changes with the new protobuf file in their next release cycle.

Wonderful, isn't it?

Note: this does not come free, you need to obey some guidelines3.

Adding more message types


There are many other ways to use protobuf, please refer to the reference4.

Below is an example on adding a new message using repeated fields.

syntax = "proto2";

package tutorial;

message PaymentInfo {

  required int32 account_id = 1;
  optional float total_sales = 2;

  enum currency {
    SGD = 0;
    USD = 1;
  }

  optional currency sales_currency = 4;

  enum payment_method {
    CASH = 0;
    CREDIT_CARD = 1;
    WALLET = 2;
  }

  optional payment_method method = 3;
  optional int32 cart_id = 5;

}

message Cart {
  required int32 cart_id = 1;

  message Item {
    optional string item = 1;
    optional int32 quantity = 2;
    optional float amount = 3;
  }

  repeated Item Items = 2;
}
import events_pb2
from google.protobuf.json_format import MessageToJson

cart_event = events_pb2.Cart()
cart_event.cart_id = 5678

cart_items = cart_event.Items.add()
cart_items.amount = 40.0
cart_items.quantity = 3
cart_items.item = 'chicken'

cart_items = cart_event.Items.add()
cart_items.amount = 60.0
cart_items.quantity = 2
cart_items.item = 'beef'

data = MessageToJson(cart_event)
import json
json.loads(data)

Json output:

{
   "cartId":5678,
   "Items":[
      {
         "item":"chicken",
         "quantity":3,
         "amount":40.0
      },
      {
         "item":"beef",
         "quantity":2,
         "amount":60.0
      }
   ]
}

Neat, right?

Conclusion

Hopefully you have learnt abit on protobuf. You will find many articles / online posts advocating for protobuf, describing the benefits. As always, there are two sides to a coin, I leave you with the other side5 if you are interested.

Referrences: