DataRaccoon

Engineering Related


Introducing Protobuf for schema compatibility

  • Product Team changed their schema without notifying you which lead to your pipelines/serving breaking!?
  • Tired of messy json data that have no form or structure and hitting your kafka/pubsub Stream?
  • Want to make sure data types and fields are consistent when you are recieving or sending API responses?

Typed Python to avoid ambiguity

  • Documentation is always a afterthought? Multiple variables and objects to keep track of?
  • Data Scientist code is so hard to read and collaborate! It is all over the place!
  • Having difficulty to validate data from other parties or databases?

Docker

Reproduce your environment with Docker!

  • Hear good things about docker but not sure where to get started?
  • Docker tutorials out there not useful as a data scientist / analyst?
  • Developing within docker environment! (Remote IDE)

Multiplie Docker running with Docker Compose

  • Introduction to docker compose!
  • Want to mock your environment with a database?
  • Running an interactive environment micking production workflow?

(Py)Testing

Test your code!

  • What is the deal with testing? How do you even test data?
  • Testing is only for software engineers! I am a data scientist!
  • How should a data scientist view testing?

More on pytest - Marking, Configs & fixtures!

  • What is this conftest.py and pytest.ini that I see in repositaries?
  • How to share fixtures across multiple scripts?
  • (Slightly) advance topics with fixtures on scoping and autouse.

Even more on pytest - Mocking!

  • Have external dependencies / requests? But tests should be isolated!
  • What about complicated functions that take a significant long time to run?
  • How do I check my end to end flow is correct?

Easy ETL with DBT!

  • Data Build Tool (DBT) by fish town analytics!
  • Having difficulty in managing SQL scripts? Too many tested Query? Or figuring the execution graph? Finding yourself running scripts manually?
  • How do you even do testing for ETL jobs or Queries!?!?

Write better Pandas code

  • Write better pandas code as a data scientist so that your engineer will love you! (Or easier to productionize your code)
  • Frustrated by constant reassignment and/or hard referencing when creating new columns with pandas.
  • Like spark Dataframe API or R's Dplyr API or chaining / piping your code in general? (Some sort of functional programming)

Commands as code - Makefile!

  • Storing steps/procedures in README(s)? Keep copy and pasting commands to run?
  • Forgot your commands when revisiting your projects months ago?
  • Wish to hand over projects / commands / procedures to your team mates easily?

Introduction to Asyncio

  • Dealing with latency when trying to make multiple web/DB requests or building your API product?
  • Trying to understand what is ASYNCIO?
  • Useful to understand if you are going to develop data products with FASTAPI.

Math/Stats


Introduction to metrics!

  • Getting started with metrics?
  • Having difficulty remembering metrics?
  • Sensitivity, power, recall, precision, ....

Quick Introduction to hypothesis testing

  • Data Science is not only about Machine Learning!
  • Interested to know more about a/b testing?
  • Understanding p-value, power, sample size.

(parametric) Distributions!

  • Quick overview of the common distributions!
  • Various use cases of each distributions
  • How are distributions related to one another.

Quick recap on Regression

  • Quick overview of the Regression!
  • Formulation, Derivations, assumptions.
  • Association to other concepts like logistic regression, anova etc.

Machine Learning

Gini and Entropy in Decision trees!

  • Explaining Gini and Entropy
  • Explain how they are being used in Decision trees!
  • Examples on continuous variables and more!

Often Asked


My current setup and tools

  • For those who are curious about what tools I use.
  • How i set up my machine.

Deciding what to learn!

  • My personal pitfalls and avoid my mistakes.
  • How do I decide what to learn?
  • How do I stay motivated?

Useful resources I personally used

  • Resources I have used in the past
  • Such as ML/NN to Programming languages etc
  • To be expanded!

Misc


Podcast with symbolic connection

  • An hour long podcast on how I got started in my career.
  • How did i make the switch from a "dashboard data scientist" to putting models in production.
  • And other questions such as Python or R?

Work in Progress

  • Statistics
    • Non parametric distributions
    • Statistical tests (NEXT)
    • Regression, anova
    • Bayesian methods
      • Sampling algos
  • Machine learning
    • Decision trees, (NEXT)
      • Gini, entropy, information gain
    • Neural Networks
    • Reinforcement learning
    • Vector embeddings
    • Kernel Tricks
    • knowledge graphs
  • Software engineering
    • Data structures and algorthims
    • System design
    • Machine learning design
    • Trunk based workflow
    • Regex (NEXT)
  • Generic Python
    • OOP, decorator
    • Making a python egg/wheel
    • setup tools