Problem Intro
Typically, JSON format is used to send data across systems. If you are in python land, you probably do things in a dictionary, serialize/unserialize it with json.dumps
and json.loads
respectively. If you are using a webframework such as flask
, you would use jsonify
and the get_data()
method.
For example,
import json
output = {'account_id': 1234, 'sales_amount': 40.0}
json_output = json.dumps(output)
type(json_output) #str
print(json_output)#{"account_id": 1234, "sales_amount": 40.0}
##### Using Flask ####
from flask import Flask
from flask import jsonify
app = Flask(__name__)
with app.app_context():
json_output = jsonify(output)
print(json_output) #<Response 40 bytes [200 OK]>
print(json_output.get_data()) #b'{"account_id":1234,"sales_amount":40.0}\n'
input = (json_output.get_data())
print(json.loads(input)) #{'account_id': 1234, 'sales_amount': 40.0}
Problem 1
Using JSON poses quite a few challenges, the sending/recieving party does not need to enforce the data types, making it hard to parse further downstream. For instance,
account_id
can be in string format "1234"
sales_amount
can be int 40
(sounds familar?)
Problem 2
Among us data folks, we usually referrence the dictionary with keys. e.g data.get('sales_amount')
. What if additional fields is required, or change schema naming from sales_amount
to total_sales
. Then downstream code referrence such as data.get('sales_amount')
would fail.
In production ML systems making predictions, changing the schema naming convention would yield different results. This results in release cycles getting more complicated and systems being more tightly coupled.
Problem 3
Furthermore, in a typical data lake (which are very expensive to build and maintain), everything is usually dumped as json blobs no consistency, resulting in a significant technical debt, and data cleaning will definitely be painful. It is no wonder - data scientist spend more than 80% of their time cleaning data.
Ouch!
Introducing protobuf
Lets come up with a mock problem, that we are trying to send / parse the following data:
fields |
data type |
account_id |
int |
sales_amount |
float |
method |
enum |
For those unfamiliar, enum
is factor
variables. For example, in this case, the method
can only take values of CASH
, CREDIT_CARD
, WALLET
.
You can read the google guide on protobufs. Essentially, protobuf generates a python module/script from a proto
file.
In this example, we define events.proto
which you can use protoc
to generate events_pb2.py
.
Note, if you are interested in following along, you may go through the hands on section. Otherwise feel free to skip to the next section.
Hands on - setup (Optional)
Installation & Setup
Using Protobuf
import events_pb2 #created by above steps
payment = events_pb2.PaymentInfo()
payment.account_id = 1234
payment.sales_amount = 142.0
payment.method = 1 #CREDITCARD
print(payment)
"""
account_id: 1234
sales_amount: 142.0
method: CREDIT_CARD
"""
message = payment.SerializeToString()
print(message) #b'\x08\xd2\t\x15\x00\x00\x0eC\x18\x01'
type(message) #<class 'bytes'>
# Downstream user recieving the message:
payment_proto = events_pb2.PaymentInfo()
payment_proto.ParseFromString(message)
payment_proto #same output as above
from google.protobuf.json_format import MessageToJson
json_msg = MessageToJson(payment_proto)
print(json_msg)
"""
{
"accountId": 1234,
"salesAmount": 142.0,
"method": "CREDIT_CARD"
}
"""
Your message is serialized to '\x08\xd2\t\x15\x00\x00\x0eC\x18\x01'
!
Features
Python Object
Data Size
Wrong data type
-
What if you define data wrongly?
payment = events_pb2.PaymentInfo()
payment.account_id ="123"
Traceback (most recent call last):
File "<input>", line 1, in <module>
TypeError: '123' has type str, but expected one of: int, long
or miss out a field?
payment = events_pb2.PaymentInfo()
payment.sales_amount = 142.0
payment.SerializeToString()
Traceback (most recent call last):
File "<input>", line 3, in <module>
google.protobuf.message.EncodeError: Message tutorial.PaymentInfo is missing required fields: account_id
Changing Schema
Now, as your business grows, you start introducing a few more features, and would like to change your data model.
fields |
data type |
account_id |
int |
sales_total |
float |
sales_currency |
enum |
method |
enum |
cart_id |
int |
Notice that sales_amount
has now been changed to sales_total
. You will then define the new schema, and generate the python module again. For the purpose of a later demostration, we will keep the old python module and rename it to events_old_pb2.py
.
Similarly, you may choose to follow the hands on or skip to the next step.
Hands on - Changing Schema (Optional)
Rename your old proto from events_pb2.py
to events_old_pb2.py
mv events_pb2.py events_old_pb2.py
and the events.proto
should look like:
syntax = "proto2";
package tutorial;
message PaymentInfo {
required int32 account_id = 1;
optional float total_sales = 2;
enum currency {
SGD = 0;
USD = 1;
}
optional currency sales_currency = 4;
enum payment_method {
CASH = 0;
CREDIT_CARD = 1;
WALLET = 2;
}
optional payment_method method = 3;
optional int32 cart_id = 5;
}
Notice the changes from sales_amount
to total_sales
Forward compatibility
Let's use the same old message in the earlier example: b'\x08\xd2\t\x15\x00\x00\x0eC\x18\x01'
Note: everytime you import a protobuf module, you need to restart the python kernel (or start a new one)
import events_pb2
new_parser = events_pb2.PaymentInfo()
old_message = b'\x08\xd2\t\x15\x00\x00\x0eC\x18\x01'
new_parser.ParseFromString(old_message)
new_parser
"""
account_id: 1234
total_sales: 142.0
method: CREDIT_CARD
"""
the sales_amount
automatically converts to total_sales
!
Backward compatibility
Now, lets define a new event with the new schema (along with some new tricks).
import events_pb2
from google.protobuf.json_format import MessageToJson
new_parser = events_pb2.PaymentInfo()
new_parser.account_id = 123
new_parser.total_sales = 1000.0
new_parser.sales_currency = new_parser.currency.Value('USD')
new_parser.method = new_parser.payment_method.Value('CASH')
new_parser.cart_id = 1
print(new_parser)
"""
account_id: 123
total_sales: 1000.0
method: CASH
sales_currency: USD
cart_id: 1
"""
new_message = new_parser.SerializeToString()
print(new_message) #b'\x08{\x15\x00\x00zD\x18\x00 \x01(\x01'
#Extra info
import sys
sys.getsizeof(new_message) #46
sys.getsizeof(MessageToJson(new_parser)) #156
Restart your python kernel, and using the old protobuf module,
import events_old_pb2
old_parser = events_old_pb2.PaymentInfo()
new_message=b'\x08{\x15\x00\x00zD\x18\x00 \x01(\x01'
old_parser.ParseFromString(new_message)
print(old_parser)
"""
account_id: 123
sales_amount: 1000.0
method: CASH
"""
This way, up/down stream teams can have a centralize protobuf file to communicate, and systems are decoupled.
The upstream services can make changes to the schema without fear of breaking downstream systems. Correspondingly, the downstream services can update their changes with the new protobuf file in their next release cycle.
Wonderful, isn't it?
Note: this does not come free, you need to obey some guidelines.
Adding more message types
There are many other ways to use protobuf, please refer to the reference.
Below is an example on adding a new message using repeated fields.
syntax = "proto2";
package tutorial;
message PaymentInfo {
required int32 account_id = 1;
optional float total_sales = 2;
enum currency {
SGD = 0;
USD = 1;
}
optional currency sales_currency = 4;
enum payment_method {
CASH = 0;
CREDIT_CARD = 1;
WALLET = 2;
}
optional payment_method method = 3;
optional int32 cart_id = 5;
}
message Cart {
required int32 cart_id = 1;
message Item {
optional string item = 1;
optional int32 quantity = 2;
optional float amount = 3;
}
repeated Item Items = 2;
}
import events_pb2
from google.protobuf.json_format import MessageToJson
cart_event = events_pb2.Cart()
cart_event.cart_id = 5678
cart_items = cart_event.Items.add()
cart_items.amount = 40.0
cart_items.quantity = 3
cart_items.item = 'chicken'
cart_items = cart_event.Items.add()
cart_items.amount = 60.0
cart_items.quantity = 2
cart_items.item = 'beef'
data = MessageToJson(cart_event)
import json
json.loads(data)
Json output:
{
"cartId":5678,
"Items":[
{
"item":"chicken",
"quantity":3,
"amount":40.0
},
{
"item":"beef",
"quantity":2,
"amount":60.0
}
]
}
Neat, right?
Conclusion
Hopefully you have learnt abit on protobuf. You will find many articles / online posts advocating for protobuf, describing the benefits. As always, there are two sides to a coin, I leave you with the other side if you are interested.
Referrences: