Data Science & Machine Learning

Master Python, Statistics, and ML from zero to advanced with our comprehensive course.

Course Content

Your Progress

0% Complete

Introduction to Data Science & Machine Learning

What You'll Learn

By the end of this course, you'll be able to:

  • Manipulate and clean real-world data (CSV, Excel, JSON, tabular databases)
  • Explore data with descriptive statistics and visualizations
  • Build predictive models (regression, classification, clustering)
  • Use scikit-learn to train, validate and evaluate pipelines
  • Understand neural network principles and train simple models with PyTorch/TensorFlow
  • Apply advanced methods (ensemble and boosting), work with time series and prepare models for deployment
  • Interpret results statistically and design experiments

Recommended Tools & Environment

For the best learning experience, we recommend:

  • Python: Use a recent stable version of Python 3 (Python 3.14 recommended)
  • Key Libraries: NumPy, pandas, matplotlib, seaborn, scikit-learn (essential for classical ML)
  • Deep Learning: PyTorch and/or TensorFlow (choose one to start)
  • Boosting (Tabular): XGBoost and LightGBM are standard tools for high-performance tabular data

Practical Tip: Work in a virtual environment (venv/conda) and use Jupyter Notebook/JupyterLab or VS Code for interactive development.

Module 1 — Basics

Unit 1.1 — What is Data Science and Machine Learning

Objective: Understand the difference between Data Science (complete flow: collection → cleaning → analysis → deployment) and Machine Learning (models that learn from data).

Simple Explanation:

  • Data Science is a flow: understand problem → collect data → clean/format → explore (statistics/visualization) → model → communicate results
  • Machine Learning is the part that creates models that learn from data (e.g., predict prices, classify images)
  • There are two main ML categories: supervised (has labels) and unsupervised (no labels)

Practical Example (high level):

Problem: predict real estate prices.

Flow: collect ads → clean columns (area, bedrooms, neighborhood) → explore relationship between area and price → train regression → evaluate error → explain relevant variables.

Exercises (5)

  1. Define in your own words what Data Science is.
  2. Give 3 real examples where ML can be applied.
  3. Explain the difference between supervised and unsupervised learning.
  4. Give 2 reasons why data cleaning is important.
  5. List 5 steps in a Data Science project flow.

Unit 1.2 — Python Basics for Data Science

Objective: Learn basic Python concepts needed to manipulate data: types, lists, dictionaries, functions, packages and script/notebook execution.

Explanation:

  • Basic types: int, float, str, bool
  • Structures: list, tuple, dict, set
  • Control: if/else, for, while
  • Functions: def name(arg): ... return ...
  • Packages: import numpy as np, import pandas as pd

Practical Example (code):

# simple column sum with lists
areas = [50, 75, 100]
prices = [150000, 200000, 300000]
# calculate price per m2
ppms = [p/a for p,a in zip(prices, areas)]
print(ppms) # [3000.0, 2666.666..., 3000.0]

Exercises (5)

  1. Write a function that receives a list of numbers and returns the average.
  2. Make a for loop that prints only even numbers from 0 to 20.
  3. Create a dictionary that maps city names to their populations.
  4. Explain the difference between list and tuple.
  5. Install and import pandas; show the installed version (with command).

Unit 1.3 — Descriptive Statistics & Visualization

Objective: Understand central measures (mean, median, mode), dispersion (std, variance, IQR) and basic visualizations (histogram, boxplot, scatter).

Explanation:

  • Mean: arithmetic average — sensitive to outliers.
  • Median: middle value — robust to outliers.
  • Std / Variance: measure dispersion around the mean.
  • IQR: Q3 - Q1, useful for detecting outliers.

Practical Example (pandas + matplotlib):

import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('imoveis.csv') # columns: area, preco
print(df['preco'].mean(), df['preco'].median())
df['preco'].hist()
plt.title('Distribution of prices')
plt.show()

Exercises (5)

  1. Compute mean, median and standard deviation for a list of prices.
  2. Plot a histogram using matplotlib from a pandas Series.
  3. Explain the difference between boxplot and histogram.
  4. For a dataset with outliers, which central measure is more robust? Why?
  5. Interpret a scatter plot between area and price: what does a positive correlation indicate?

Unit 1.4 — Data Manipulation with pandas

Objective: Learn reading, selection, filtering, aggregation, joins and missing value handling with pandas.

Key operations:

  • pd.read_csv, df.head(), df.describe()
  • Selection: df['col'], df.loc[row, col], df.iloc[]
  • Filtering: df[df['preco'] > 100000]
  • Groupby: df.groupby('bairro')['preco'].mean()
  • Missing: df.isna(), df.dropna(), df.fillna()

Practical Example (join & aggregation):

imoveis = pd.read_csv('imoveis.csv') # id_imovel, bairro, area, preco
bairros = pd.read_csv('bairros.csv') # bairro, renda_media
df = imoveis.merge(bairros, on='bairro', how='left')
agg = df.groupby('bairro').agg({'preco':'mean','area':'median'}).reset_index()

Exercises (5)

  1. Read a CSV and show the first 10 rows.
  2. Filter rows where area is null and fill with the column median.
  3. Aggregate by neighborhood showing average price and count of properties.
  4. Join two DataFrames on a common key and explain inner/left/right/outer.
  5. Create a new column preco_m2 = preco / area and show the top 5.

Unit 1.5 — Linear Algebra & Calculus (intuition)

Objective: Present essential concepts of vectors, matrices, dot product and derivatives — enough to understand ML algorithms.

Explanation:

  • Vectors: lists of numbers; operations like addition and scalar multiplication.
  • Matrices: 2D arrays; matrix multiplication represents linear transformations.
  • Dot product: measures similarity.
  • Derivatives: rate of change — gradients used in optimization.

Practical Example (linear regression):

Linear regression: y = w0 + w1*x1 + w2*x2 — optimization (least squares) finds w that minimizes sum((y - ŷ)^2).

Exercises (5)

  1. Compute dot product of [1,2,3] and [4,5,6].
  2. Multiply a 2x3 matrix by a 3x2 matrix (numeric example).
  3. Show how to get derivative of f(x)=x^2 step by step.
  4. Explain why gradient is used to minimize functions.
  5. Give an intuitive application of matrices in image transformations.

Module 2 — Intermediate

Unit 2.1 — Preprocessing & Feature Engineering

Objective: Learn techniques to prepare data for models: missing handling, categorical encoding, scaling and creating new features.

Explanation:

  • Missing: simple imputation (mean/median), model-based imputation, NA flags.
  • Categorical: one-hot, label encoding, target encoding (use with care).
  • Scaling: StandardScaler, MinMaxScaler — important for scale-sensitive models.
  • Feature engineering: combine variables, extract date/time, group aggregations.

Practical Example (scikit-learn pipeline):

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

num_cols = ['area','idade']
cat_cols = ['bairro','tipo']

num_pipe = Pipeline([('impute', SimpleImputer(strategy='median')),
                     ('scale', StandardScaler())])
cat_pipe = Pipeline([('impute', SimpleImputer(strategy='most_frequent')),
                     ('ohe', OneHotEncoder(handle_unknown='ignore'))])

preproc = ColumnTransformer([('num', num_pipe, num_cols),
                             ('cat', cat_pipe, cat_cols)])

Exercises (5)

  1. Apply SimpleImputer to numeric and categorical columns.
  2. Do one-hot encoding manually with pandas (pd.get_dummies).
  3. Create a feature idade_do_imovel from ano_construcao.
  4. Explain when to use MinMax vs StandardScaler.
  5. Build a pipeline that applies preprocessing and trains a RandomForestRegressor.

Unit 2.2 — Supervised Models (Regression & Classification)

Objective: Understand and apply linear regression, logistic regression, decision trees and k-NN.

Explanation:

  • Linear regression: predict continuous outcomes.
  • Logistic regression: estimate probabilities for classification.
  • Decision trees: interpretable rules-based models.
  • k-NN: classification by neighborhood similarity.

Practical Example (scikit-learn — linear regression):

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X = df[['area','idade_do_imovel']]
y = df['preco']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=42)
model = LinearRegression().fit(X_train, y_train)
pred = model.predict(X_test)
print("RMSE:", mean_squared_error(y_test, pred, squared=False))

Exercises (5)

  1. Train a linear regression and compute RMSE.
  2. Train a logistic regression to classify churn; show confusion matrix.
  3. Explain overfitting and underfitting with visual examples.
  4. Train a decision tree and visualize it (export_graphviz).
  5. Compare k-NN with trees — when is each appropriate?

Unit 2.3 — Unsupervised Methods (Clustering, PCA)

Objective: Learn K-means, DBSCAN and PCA for dimensionality reduction and exploration.

Explanation:

  • K-means: partitions into k clusters; sensitive to scale.
  • DBSCAN: density-based clusters; detects noise.
  • PCA: orthogonal components that explain variance; useful for visualization.

Practical Example (K-means + PCA):

from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

X = df[['area','preco','renda_media']]
pca = PCA(n_components=2).fit_transform(X)
kmeans = KMeans(n_clusters=3, random_state=42).fit(pca)
# plot pca with colors by cluster

Exercises (5)

  1. Apply K-means on a dataset and interpret cluster centers.
  2. Use silhouette_score to assess number of clusters.
  3. Apply DBSCAN and compare with K-means.
  4. Perform PCA and show explained variance by first 2 components.
  5. Explain limitations of PCA.

Unit 2.4 — Evaluation, Validation & Model Selection

Objective: Learn metrics (RMSE, MAE, AUC, F1), cross-validation and hyperparameter tuning.

Explanation:

  • Validation: hold-out, k-fold CV
  • Metrics: choose according to problem (regression vs classification)
  • Tuning: GridSearchCV, RandomizedSearchCV

Practical Example (GridSearchCV):

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

param_grid = {'n_estimators':[50,100], 'max_depth':[None,10,20]}
gs = GridSearchCV(RandomForestRegressor(random_state=42), param_grid, cv=5, scoring='neg_root_mean_squared_error')
gs.fit(X_train, y_train)
print(gs.best_params_, gs.best_score_)

Exercises (5)

  1. Run 5-fold CV for a regression and compare with hold-out.
  2. Compute precision, recall and F1 for a classifier.
  3. Use GridSearch to search C for LogisticRegression.
  4. Explain when to use AUC instead of accuracy.
  5. Interpret a ROC curve.

Unit 2.5 — Pipelines & Light Production with scikit-learn

Objective: Build reproducible pipelines, save models and apply transformations consistently.

Explanation:

  • Pipeline ensures same preprocessing in train and inference
  • Serialization: joblib.dump / load
  • Use ColumnTransformer and Pipeline to handle diverse columns

Practical Example (save & load):

from joblib import dump, load
pipeline = Pipeline([('preproc', preproc), ('model', RandomForestRegressor())])
pipeline.fit(X_train, y_train)
dump(pipeline, 'modelo.joblib')
# in production
model = load('modelo.joblib')
preds = model.predict(X_new)

Exercises (5)

  1. Build a full pipeline (preproc + model) and save with joblib.
  2. Explain why you should not fit StandardScaler on the full dataset before split.
  3. Show how to access the model via pipeline.named_steps['model'].
  4. Create a function that reads a CSV and returns predictions using a saved model.
  5. List best practices to prepare models for production.

Module 3 — Advanced

Unit 3.1 — Introduction to Deep Learning

Objective: Understand neurons, layers, loss, backpropagation and train a basic network with PyTorch or TensorFlow.

Explanation:

  • Perceptron: weighted sum + activation
  • Architectures: fully connected, convnets, RNN/LSTM for sequences
  • Training: forward → loss → backward → optimization (SGD, Adam)

Practical Example (PyTorch):

import torch
import torch.nn as nn
import torch.optim as optim

class SimpleNet(nn.Module):
    def __init__(self, in_dim):
        super().__init__()
        self.fc = nn.Sequential(
            nn.Linear(in_dim,64),
            nn.ReLU(),
            nn.Linear(64,2)
        )
    def forward(self,x):
        return self.fc(x)

model = SimpleNet(in_dim=10)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

Exercises (5)

  1. Explain ReLU vs Sigmoid activation.
  2. Implement training loop (epoch, batch) in PyTorch (pseudo-code).
  3. What is overfitting in neural nets and how to reduce it? List 4 techniques.
  4. Train a simple network on a synthetic dataset.
  5. Compare PyTorch vs TensorFlow in 3 practical aspects.

Unit 3.2 — Advanced Methods: Ensembles & Boosting

Objective: Understand bagging and boosting and apply XGBoost/LightGBM for tabular data.

Explanation:

  • Bagging (RandomForest) reduces variance by training many trees on bootstrapped samples.
  • Boosting (XGBoost, LightGBM) sequentially corrects previous errors; often state-of-the-art for tabular data.
  • Tuning: learning_rate, n_estimators, max_depth, subsample, colsample_bytree.

Practical Example (XGBoost):

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=42)
model = xgb.XGBRegressor(n_estimators=200, learning_rate=0.05, max_depth=6)
model.fit(X_train, y_train, eval_set=[(X_test, y_test)], early_stopping_rounds=10, verbose=False)
pred = model.predict(X_test)
print("RMSE:", mean_squared_error(y_test, pred, squared=False))

Exercises (5)

  1. Train RandomForest and XGBoost and compare RMSE & runtime.
  2. Use feature_importances_ to interpret variables.
  3. Explain early_stopping_rounds and its benefits.
  4. Show how to enable GPU for XGBoost/LightGBM (reference docs).
  5. Discuss risks of target encoding and how to mitigate leakage.

Unit 3.3 — Time Series & Advanced Forecasting

Objective: Handle temporal data: decomposition, seasonality, ARIMA/SARIMA, Prophet and neural approaches (LSTM/Transformers).

Explanation:

  • Components: trend, seasonality, noise.
  • Classical models: AR, MA, ARIMA, SARIMA for linear temporal dependence.
  • Modern: Prophet for easy modeling; LSTM/Transformers for complex patterns.

Practical Example (decomposition):

from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(series, model='additive', period=12)
result.plot()

Exercises (5)

  1. Decompose a monthly series into trend/seasonality/noise.
  2. Fit an ARIMA model and interpret p,d,q parameters.
  3. When use regression-based models vs sequential models?
  4. Train Prophet on sales series and evaluate forecasts.
  5. Discuss temporal leakage and how to avoid it (time-based splits).

Unit 3.4 — Deploy, MLOps Basics & Monitoring

Objective: Learn ways to put models into production, version, monitor and keep performance stable.

Explanation:

  • Deploy: REST API (Flask/FastAPI), batch jobs, serverless
  • Containerize: Docker to isolate environment
  • MLOps: CI/CD, tests, drift monitoring, model registry (MLflow)
  • Monitoring: latency, accuracy, feature distributions

Practical Example (FastAPI + joblib):

from fastapi import FastAPI
from joblib import load
import pandas as pd

app = FastAPI()
model = load('modelo.joblib')

@app.post('/predict')
def predict(data: dict):
    df = pd.DataFrame([data])
    pred = model.predict(df)[0]
    return {'prediction': float(pred)}

Exercises (5)

  1. Create an API that receives attributes and returns a prediction.
  2. Sketch a Dockerfile for your app.
  3. Explain data drift and how to detect it.
  4. Introduce MLflow and describe how to register an experiment.
  5. Design a rollback plan if a production model degrades.

Unit 3.5 — Advanced Statistics: Tests, Bayesian Inference & Experiment Design

Objective: Learn hypothesis testing, p-value, confidence intervals, power analysis and basics of Bayesian inference.

Explanation:

  • Tests: t-test, chi-squared, ANOVA
  • P-value: probability of observing data at least as extreme under H0
  • CI: plausible interval for a parameter
  • Bayesian inference: combine prior + evidence = posterior
  • Experiment design: power, MDE, multiple testing corrections

Practical Example (t-test):

from scipy.stats import ttest_ind
stat, p = ttest_ind(group_a['conversao'], group_b['conversao'])
print('p-value:', p)

Exercises (5)

  1. Perform a t-test between two samples and interpret p-value = 0.03.
  2. Compute a 95% confidence interval for a mean.
  3. Explain difference between statistical significance and practical relevance.
  4. Give an example of a prior and how it affects posterior.
  5. Plan a simple A/B test (hypothesis, metric, sample size).

Conclusion, Study Plan & Portfolio Projects

Summary: You covered statistics, data manipulation, visualization, classical models, preprocessing, pipelines, deep learning and MLOps. Regular practice and real projects consolidate learning.

Advanced Study Plan (3–6 months)

  1. Month 1–2: practical projects (price prediction, classification). Strengthen statistics.
  2. Month 3: advanced ML — ensembles, tuning, intensive feature engineering.
  3. Month 4: deep learning practical — CNNs or NLP depending on interest.
  4. Month 5–6: deploy & MLOps — CI/CD pipelines, monitoring and automation.
  5. Continuous: participate in competitions (Kaggle), read papers, contribute to repos.

Portfolio Project Ideas

  • Project 1 — Real estate price prediction: collection, EDA, feature engineering, XGBoost, deploy API.
  • Project 2 — Churn classifier: feature table, explainability (SHAP), A/B test retention strategy.
  • Project 3 — Metrics dashboard: ETL pipeline + model + dashboard (Streamlit/Dash) showing predictions and drift monitor.
  • Project 4 — Simple recommender: item-item collaborative filtering.
  • Project 5 — Sales time series: decomposition, Prophet/ARIMA and LSTM, deploy daily forecasts.

Extras & Best Practices

  • Pin environment versions (requirements.txt / environment.yml).
  • Always split train/test before transformations using the target.
  • Automate experiments (MLflow) and log hyperparameters.
  • Reproducibility: set seeds, document, version data & models.