DataRaccoon
Data Raccoon
About
Knowledge
Products
Engineering Related
Introducing Protobuf for schema compatibility
Product Team changed their schema without notifying you which lead to your pipelines/serving breaking!?
Tired of messy json data that have no form or structure and hitting your kafka/pubsub Stream?
Want to make sure data types and fields are consistent when you are recieving or sending API responses?
Typed Python to avoid ambiguity
Documentation is always a afterthought? Multiple variables and objects to keep track of?
Data Scientist code is so hard to read and collaborate! It is all over the place!
Having difficulty to validate data from other parties or databases?
Docker
Reproduce your environment with Docker!
Hear good things about docker but not sure where to get started?
Docker tutorials out there not useful as a data scientist / analyst?
Developing within docker environment! (Remote IDE)
Multiplie Docker running with Docker Compose
Introduction to docker compose!
Want to mock your environment with a database?
Running an interactive environment micking production workflow?
(Py)Testing
Test your code!
What is the deal with testing? How do you even test data?
Testing is only for software engineers! I am a data scientist!
How should a data scientist view testing?
More on pytest - Marking, Configs & fixtures!
What is this
conftest.py
and
pytest.ini
that I see in repositaries?
How to share fixtures across multiple scripts?
(Slightly) advance topics with fixtures on scoping and autouse.
Even more on pytest - Mocking!
Have external dependencies / requests? But tests should be isolated!
What about complicated functions that take a significant long time to run?
How do I check my end to end flow is correct?
Easy ETL with DBT!
Data Build Tool (DBT) by fish town analytics!
Having difficulty in managing SQL scripts? Too many tested Query? Or figuring the execution graph? Finding yourself running scripts manually?
How do you even do testing for ETL jobs or Queries!?!?
Write better Pandas code
Write better pandas code as a data scientist so that your engineer will love you! (Or easier to productionize your code)
Frustrated by constant reassignment and/or hard referencing when creating new columns with pandas.
Like spark Dataframe API or R's Dplyr API or chaining / piping your code in general? (Some sort of functional programming)
Commands as code - Makefile!
Storing steps/procedures in README(s)? Keep copy and pasting commands to run?
Forgot your commands when revisiting your projects months ago?
Wish to hand over projects / commands / procedures to your team mates easily?
Introduction to Asyncio
Dealing with latency when trying to make multiple web/DB requests or building your API product?
Trying to understand what is ASYNCIO?
Useful to understand if you are going to develop data products with FASTAPI.
Math/Stats
Introduction to metrics!
Getting started with metrics?
Having difficulty remembering metrics?
Sensitivity, power, recall, precision, ....
Quick Introduction to hypothesis testing
Data Science is not only about Machine Learning!
Interested to know more about a/b testing?
Understanding p-value, power, sample size.
(parametric) Distributions!
Quick overview of the common distributions!
Various use cases of each distributions
How are distributions related to one another.
Quick recap on Regression
Quick overview of the Regression!
Formulation, Derivations, assumptions.
Association to other concepts like logistic regression, anova etc.
Machine Learning
Gini and Entropy in Decision trees!
Explaining Gini and Entropy
Explain how they are being used in Decision trees!
Examples on continuous variables and more!
Often Asked
My current setup and tools
For those who are curious about what tools I use.
How i set up my machine.
Deciding what to learn!
My personal pitfalls and avoid my mistakes.
How do I decide what to learn?
How do I stay motivated?
Useful resources I personally used
Resources I have used in the past
Such as ML/NN to Programming languages etc
To be expanded!
Misc
Podcast with symbolic connection
An hour long podcast on how I got started in my career.
How did i make the switch from a "dashboard data scientist" to putting models in production.
And other questions such as Python or R?
Work in Progress
Statistics
Non parametric distributions
Statistical tests (NEXT)
Regression, anova
Bayesian methods
Sampling algos
Machine learning
Decision trees, (NEXT)
Gini, entropy, information gain
Neural Networks
Reinforcement learning
Vector embeddings
Kernel Tricks
knowledge graphs
Software engineering
Data structures and algorthims
System design
Machine learning design
Trunk based workflow
Regex (NEXT)
Generic Python
OOP, decorator
Making a python egg/wheel
setup tools