
Problem Intro

As a (DS) data Scientist, 80% of our work is dealing with messy data. Our problems are not limited to:

  • Database id being referenced as _id
  • Empty values can be referenced as NA, None, "None", "EMPTY"...
  • Data being passed to you during production requests are wrong

As a DS working with other DS, or as a en engineer working with DS,

  • Reassigning a variable name multiple times.
  • Hard to track variables naming convention.
  • Code are usually contextual heavy (e.g why did the DS divide this number by that aggregation?)

Small Note: pylint/flake8 are also useful to address the above problems.

Pre-req / Setup!

Chop Chop! (Hurry up!)
Code has been prepared for you, just copy and run in your terminal!
You still need check and/or create the appropriate directory though.

Assuming you are using anaconda distribution with mac/linux/docker etc,

mkdir typedpy
cd typedpy
cat << "EOF" | pbcopy
echo "$(pbpaste)" > requirements.txt
conda create -n typedpy python=3.8
conda activate typedpy
pip install -r requirements.txt

Hello World!

Introducing Typed Python! Here is a simple example using native python and the mypy package.

def add(x:int, y:int) -> int:
    """[Simple addition function]

        x ([int]): [An integer]
        y ([int]): [An integer]
    return x+y

Suppose a DS decides to use this function for another purpose in a python script,

add("hello, ", "how are you")
# output
# hello,how are you'

By using mypy in your terminal where the script exists:


output: error: Argument 1 to "add" has incompatible type "str"; expected "int" error: Argument 2 to "add" has incompatible type "str"; expected "int"
Found 2 errors in 1 file (checked 1 source file)
cat << "EOF" | pbcopy
def add(x:int, y:int) -> int:
    """[Simple addition function]

        x ([int]): [An integer]
        y ([int]): [An integer]
    return x+y

add("hello, ", "how are you")
echo "$(pbpaste)" >

However, the downside of this is the code still runs, and it does not warn the user of doing something unintended!


Introducing Pydantic!

Everything starts with a BaseModel, like so:

from pydantic import BaseModel 

class InputNumbers(BaseModel):
    This is where the doc string usually goes

mynumbers = InputNumbers(a=10,b=100)

And you can define your function as follows:

def addition(input: InputNumbers) -> int:
    return input.a + input.b

input = InputNumbers(a=10,b=100)
# InputNumbers(a=10, b=100)

Or you can use dictionary inputs
     - useful in handling json requests
input_dict = dict(a=11,b=101)
input2 = InputNumbers(**input_dict)


Using the similar example, suppose the user tries to do string addition:

InputNumbers(a='I am so stupid',b=100)

ValidationError: 1 validation error for InputNumbers
  value is not a valid integer (type=type_error.integer)

Or the user forgets to input certain values:

InputNumbers(a=10) #b is missing

ValidationError: 1 validation error for InputNumbers
  field required (type=value_error.missing)

Warning! if python allows for the conversion, then pydantic will not warn you. Do note that this behavior is intended!

For example, in python it is acceptable to str(1) or int("1")

class Example(BaseModel):
    a: int
    b: float
    c: int
    d: str

input_dict = dict(a=1.1, b=1.2, c='4', d=100)

Example(a=1, b=1.2, c=4, d='100')


Because we are using python classes and declaring types in the functions, it enables auto complete when developing the functions, speeding up your workflow!

If you are using IDE,


You can also define outputs with pydantic:

from pydantic import BaseModel

class ExampleIn(BaseModel):
    a: int
    b: int

class ExampleOut(BaseModel):
    addition: int
    multiplication: int
    division: float

def compute_features(input: ExampleIn) -> ExampleOut:
    add: int = input.a + input.b
    multi: int = input.a * input.b
    div: float = input.a / input.b
    return ExampleOut(addition=add, multiplication=multi, division=div)

In = ExampleIn(a=10,b=100)

ExampleOut(addition=110, multiplication=1000, division=0.1)


The full list of types available can be found in the docs, I will go through the most commonly used in my experience.

We will be making use of the Typing library for certain cases. The reason will be explained further below.

Default Values

from pydantic import BaseModel
from typing import Optional

class Example(BaseModel):
    required: int #no value specified
    default_val: str = 10
    optional_val: Optional[int]

# Example(required=1, default_val=10, optional_val=None)

# Example(required=2, default_val='10', optional_val=None)

Optional Values

from pydantic import BaseModel
from typing import Optional

class Example(BaseModel):
    required: int #no value specified
    default_val: str = 10
    optional_val: Optional[int]

Example(required=3,default_val=20,optional_val=100 )
# Example(required=3, default_val='20', optional_val=100)


from pydantic import BaseModel
from typing import Optional

class Example(BaseModel):
    required: int #no value specified
    default_val: str = 10
    optional_val: Union[int,None]
    optiona_val2: Union[int,str,float]

Aside: Optional is actually Union[..., None]

List, Dict, Any

  • What if you want to use certain python structures?
  • Unsure of what data type to use?
from typing import List, Dict, Any

# This will throw an error
var: list[float]

# this will not:
var: List[float]
var2: Dict[str, float]
var3: List[Any]

Enum / IntEnum

You use Enum generally when you want a variable to take in a set of categorical values.

from enum import Enum, IntEnum

class Animal(str,Enum):
    DOG: str = 'DOG'
    CAT: str = 'CAT'

class Action(int,Enum):
    JUMP = 1
    SIT = 2 
    LIEDOWN = 3
    PAW = 4    

You can use these classes as follows:


Complex Models

You can then define models/classes like this:

from typing import List, Dict, Set
from pydantic import BaseModel
from enum import Enum, IntEnum

class Animal(str, Enum):
    DOG: str = "DOG"
    CAT: str = "CAT"

class Action(IntEnum):
    JUMP = 1
    SIT = 2
    LIEDOWN = 3
    PAW = 4

class Pet(BaseModel):
    category: Animal
    tricks: List[Action]

class Attributes(BaseModel):
    age: int
    country: str

class House(BaseModel):
    Pets: List[Pet]
    attributes: Attributes

pet1 = Pet(category=Animal.DOG, tricks=[Action.JUMP, Action.SIT])
pet2 = Pet(category=Animal.CAT, tricks=[Action.LIEDOWN, Action.PAW])
House(Pets=[pet1, pet2], attributes=dict(age=10, country="Singapore"))

House(Pets=[Pet(category=<Animal.DOG: 'DOG'>, 
tricks=[<Action.JUMP: 1>, <Action.SIT: 2>]), 
Pet(category=<Animal.CAT: 'CAT'>, tricks=[<Action.LIEDOWN: 3>,
 <Action.PAW: 4>])], attributes=Attributes(age=10, country='Singapore'))


This section is largely similar to the docs here and the documentation is pretty good.

Instead, i will highlight some specific notes/details that is tend to be overlooked.

In summary, this is what a typical validator looks like:

from pydantic import BaseModel, validator
from datetime import datetime
from time import time

class Account(BaseModel):
    account_id: int
    date_join: datetime

    def time_must_be_before_today(cls, v):
        if v >
            raise ValueError("Are you from the future?")
        return v

Account(account_id=123, date_join=datetime(3000, 12, 1))

ValidationError: 1 validation error for Account
  Are you from the future? (type=value_error)

The way to go about understanding the validator declarator, is that it is a class method, and v represents the attribute date_join as specified above.

Also, at the validator, you can choose to edit the variable.


class Example(BaseModel):
    even_num: int

    def make_it_even(cls,v):
        if v % 2 == 0:
            return v
            return v+1


Handling messy data

Now, suppose your upstream has messy data values, rather than defining a function,you can just let pydantic do the job for you.

class CleanData(BaseModel):
    value: str

    def change_all(cls,v):
        if v in ["empty","NA","NONE","EMPTY","INVALID"]:
            v = "not supplied"
        return v

This also allows for cleaner scripts and faster workflow. It also isolates the data cleaning in each step of the process.


Sometimes you are expected to return the data in json format, and certain data types in python is not supported natively.

For example:

import json

TypeError: Object of type set is not JSON serializable

class SpecialSet(BaseModel):
    myset: set

example = SpecialSet(myset=set([1,2,3]))
'{"myset": [1, 2, 3]}'

If you are returning in dictionary, with the earlier example:

house = House(Pets=[pet1, pet2], attributes=dict(age=10, country="Singapore"))


{'Pets': [{'category': <Animal.DOG: 'DOG'>,
   'tricks': [<Action.JUMP: 1>, <Action.SIT: 2>]},
  {'category': <Animal.CAT: 'CAT'>,
   'tricks': [<Action.LIEDOWN: 3>, <Action.PAW: 4>]}],
 'attributes': {'age': 10, 'country': 'Singapore'}}

'{"Pets": [{"category": "DOG", "tricks": [1, 2]}, {"category": "CAT", "tricks": [3, 4]}], "attributes": {"age": 10, "country": "Singapore"}}'

Note: full docs found here. It is worth while taking a look and understand the other methods available, specifically the exclude/include methods.

Using Fields

Sometimes, your upstream / downstream:

  • reference a schema with a different name,
  • or is prone to schema changes,
  • or has a different perspective of CamelCase or snake_case.

This is where Field customisation becomes very useful.

Here are two examples:


from pydantic import BaseModel, Field

class Example(BaseModel):
    booking_id: int = Field(..., alias="_id", description="This is the booking_id")

example = Example(_id=123)
'{"booking_id": 123}'
'{"_id": 123}'

By using alias, you are able have cleaner code as your application code will be independent of your inputs/outputs as per your requirements docs.

Alias Generators

Suppose you prefer snake_case, but your upstream sends in CamelCase,

from pydantic import BaseModel

def to_camel(string: str) -> str:
    return ''.join(word.capitalize() for word in string.split('_'))

class Example(BaseModel):
    i_love_camel_case: str
    yes_i_really_do: str

    class Config:
        alias_generator = to_camel

eg = Example(ILoveCamelCase = "TRUE", YesIReallyDo ="YES, REALLY")        

official docs here


We have seen that with pydantic classes:

  • How you can code your application logic that is independent of your upstream/downstream by using alias.
  • Different values can be imputed or values can be checked with validators
  • variables can also be adjusted within the pydantic class
  • Validating data types are correct before proceeding
  • Objects are clean with clear attributes, being functions being statically typed with 0 ambiguous inputs and outputs. This will also make testing easier.
  • Objects can be documented (versus typical code blocks that is usually done as an after thought) with the help of class doc strings and Fields descriptions.

