Technology

Thursday, April 3, 2025
in Technology
5 min read

Using Github Actions to deploy an Astro app to S3

If you have trouble viewing the images in this blog post, turn on light mode in the upper right corner of the page.

This blog post, along with this site, are hosted on an AWS S3 bucket. But how do I develop a new blog post, test it out locally, and then deploy those changes to my S3 bucket without manually uploading each file?

No need to wonder. I'll show you!

1. Create a new user with proper permissions in AWS IAM

First step is to create a user that the Github Actions virtual machine will use programmatically. I wouldn't recommend to recycle an existing user for this purpose, especially if that user is attached to other services. If something were to go wrong, it would be easy to revoke access from only the Github Actions user without it interfering with any of your other services.

Log into your AWS console and go to the AWS IAM user interface to create a new user.

Once you've decided on a cool name, attach the AmazonS3FullAccess policy to this user.

Create user in IAM

Afterwards, click on Create access key and finish through the dialog box options. Afterwards, copy your access key and secret key.

Create access key

2. Add AWS access and secret key to your Github repository

After generating your keys, navigate to your Github repository and add your secrets to your repository (Settings > Security > Secrets and variables > Actions).

I named my access key AWS_ACCESS_KEY_ID and my secret key AWS_SECRET_ACCESS_KEY. You can name yours whatever you'd like, just keep it consistent when you add it to your Actions workflow YAML file later.

Add keys to repo

3. Create a new Actions workflow using YAML file

Caution! The Post-deployment step below is optional, but was required for my use case. This piece of code will make all of the objects within the S3 bucket public. For my use case, I'm okay with that, but be careful using this code as it will make anything in your bucket public. Please do not use this if you have any sensitive data in your bucket!

To create a Github Actions workflow, add a new YAML file in ./.github/workflows in your repository. I named my file main.yml but you can call it whatever you want.

Here's the complete .yml file:

name: Deploy website
on:
  push:
    branches:
      - master

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Setup Node
        uses: actions/setup-node@v4

      - name: Cache 
        id: cache
        uses: actions/cache@v4
        with:
          path: |
            public
            node_modules
          key: ${{ runner.os }}-cache

      - name: Install Dependencies 
        run: npm install astro

      - name: Run Build 
        run: npm run build

      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1

      - name: Deploy static site to S3 bucket
        run: aws s3 sync ./dist s3://${{ secrets.S3_STRING }} --delete

      - name: Post-deployment 
        run: |
          output=$(aws s3api list-objects-v2 --bucket ${{ secrets.S3_STRING }} --query "Contents[].Key" --output text)

          for key in $output; do
            aws s3api put-object-acl --bucket ${{ secrets.S3_STRING }} --key $key --acl public-read
          done

The on parameter forces this Action to only run when there is a push to master branch.

The steps below all relate to downloading files to the virtual machine's local files, caching for efficiency, and creating the build for the deployment. - Checkout - Setup Node - Cache - Install Dependencies - Run Build

The Configure AWS Credentials step uses an AWS verified Github Action to authenticate the Github Actions virtual machine with AWS using the previously created user. It's important to use the same names that you used when you added the secrets to the repository secrets in step 2.

The Deploy static site to S3 bucket step deploys the files in the dist folder in the virtual machine. Remember that dist is the build result created by npm run build. The --delete flag deletes any leftover files in the bucket that do not exist in the source directory.

The S3_STRING secret is nothing but the name of your bucket. I parameterized it, but you do not have to if it's not a concern for you and your specific use case.

4. Test your new Github Actions pipeline

The YAML file specifies to only trigger the pipeline when a new push occurs to master branch. However, in many enterprise scenarios, you have to go through a pull request review process before merging and integrating your changes to master, so you can remove the trigger or specify something else if you'd like to test your new Github Actions workflow.

If successful, you should see something like the following in the Actions tab of your repository:

Actions screenshot

And, for a more detailed look into the workflow, you can click on that specific run:

Actions screenshot detailed

5. Enjoy the hours saved by going outside or doing something fun

Hope you enjoyed it! Feel free to reach out to me on any of my links on the home page if you run into any trouble following my guide.

Monday, March 24, 2025
in Technology
3 min read

Why REST APIs are not ideal for analytics

Have you ever thought that too much of a good thing can be a bad thing?

I have this thought a lot -- with chocolate especially, but I rarely ever have this thought with data. In today's world, the more data, the better. But recently I've been doing a lot with social media APIs, and most of these APIs are surprisingly difficult to use for analytical purposes.

If you're not familiar with APIs, REST APIs are the most prevalent form of API protocols. Gone are the days of XML and other forms, REST is now king.

Now I'm no expert in APIs nor an API developer, so forgive me for my rudimentary explanation, but generally how REST APIs work is that they accept different methods to interact with the API; most commonly used methods are GET and POST. You can make these GET or POST requests to predefined endpoints, sometimes with parameters that you can pass, like an offset or limit parameter. Endpoints will almost always return a predefined response, too.

Take this made up example of 2 endpoints. 1 endpoint for customers that always returns the customers name, email, and address, and 1 endpoint for transactions that always returns order details, product details, and the customer who placed that order.

GET /api/v2/users

Response:
{
    "customerId": 9,
    "firstName": "John",
    "lastName": "Doe",
    "email": "john.doe@me.com",
    "address": "123 Neighbor Rd, Made-up Land, USA 12345"
}

GET /api/v2/orders

Response:
{
    "orderId": 1,
    "customerId": 9,
    "product": "Chamois Butt'r",
    "productId": 1
}

What if you wanted to only get the customer's email and the product ID of a specific order without having to make two separate API calls and receiving / requesting irrelevant bytes of data? Well, you can't really do that with most REST APIs. You would have to make both calls to two separate endpoints and parse the data that you wanted while discarding the rest (get it?).

This is problematic when you are trying to use an API designed for transactions for analytical purposes, and even more so with API limits and quotas. Imagine you have to pull 3 million instances of customers every day but the endpoints return too much data, meaning you have to paginate to get through all records successfully. So one call to the endpoint returns only 10,000 customers at a time, meaning you need at a minimum (3 million / 10,000 customers) API calls to get the entire batch (a lot), and even when you do get to the end of the pagination, you discard 90% of the data from each request! Too much waste, if you ask me.

GraphQL, on the other hand, allows you to very specifically state what you are looking for against the database without having to request from only one endpoint at a time. So, if we wanted to grab the customer's email and the product ID within an order, we could do so without having to make two separate calls to each endpoint. Here's what the request might look like:

GraphQL Query

{
  orders {
    id
    products {
        id
    }
    customers {
      email
    }
  }
}

Response:
{
    "data":
    {
      "orders": {
        "id": 1
      },
      "products": {
        "id": 1
      },
      "customers": {
        "email": "john.doe@me.com"
      }
    }
}

Much more efficient! You get all the data you need, without any of the extra fluff, in less API calls. However, you may need better parsing, but will definitely save you tons of traversing through multiple endpoints and multiple pages to get the data you need.

Now what can be done when trying to get a social media company to develop a GraphQL API rather than a REST API? Little to nothing, but if every API was GraphQL-like, then the world would be a much better place... and you would have to write less requests.

Tuesday, January 28, 2025
in Technology
2 min read

Using dictionaries to call functions in a metadata driven framework

Consider this: you've been tasked at work to come up with a solution for dynamically calling functions depending on different criteria in a metadata table. Let's say this table contains an object name, a "rule" that maps to a specific function, and any parameters for that function.

How would you go about solving this problem? Here's how I did it:

Here's an example of the aforementioned metadata table: | object_name | rule | parameters | | --- | --- | --- | | dim_customer | create_composite_key | {'key1': 'foo', 'key2': 'bar'} | | dim_customer | clean_email_address | {'column1': 'baz', 'audit': True} | | dim_products | create_composite_key | {'key1': 'qux', 'key2': 'corge'} |

Now, here's examples of the functions that need to be called for each entry in the tables above:

def gen_comp_key(df, key1, key2):
    df = df.withColumn('composite_key', concat(col(key1), lit("-"), col(key2)))
    return df

def clean_email(df, column1, audit):
    if audit:
        df = df.withColumn('email_address_clean', trim(col(column1)))
        return df

Now that we have these functions defined, we need to create a dictionary to house these functions that map to the metadata "rule" columns. Notice we store the function references (without parenthesis and parameters) and not the function executions.

dispatcher = {
    'create_composite_key': gen_comp_key,
    'clean_email_address': clean_email
}

Notice how we have stored the functions as the values and the "rule" column in the metadata table as the key. Now, let's call the functions dynamically based on the contents of the table above:

rows = [
    ('dim_customer', 'create_composite_key', {'key1': 'foo', 'key2': 'bar'}),
    ('dim_customer', 'clean_email_address', {'column1': 'baz', 'audit': True}),
    ('dim_products', 'create_composite_key', {'key1': 'qux', 'key2': 'corge'})
]

for row in rows:
    for obj, rule, params in row:
        dispatcher[rule](df=df, **params)

In the above, the dispatcher is our dictionary containing our functions. We call the dictionary containing the functions using the inputs from our table, both the rule name and the parameters for each respective data object, without explicitly calling the functions!

Wednesday, April 10, 2024
in Technology
6 min read

How I built a micro data lake on S3 using AWS, DuckDB, and Streamlit

If you have trouble viewing the images in this blog post, turn on light mode in the upper right corner of the page.

For my 24th birthday earlier this year, my girlfriend gifted me a BirdBuddy. I instantly fell in love - going to my local bird store, buying bird seed, pole, and the adapter for it. The really cool bit of all of this is that the BirdBuddy app produces a lot of data about your feeder and the birds it captures. So, naturally, I started to think about all the interesting things I could do with this data: I could make my own data lake, I could create visualizations and graphs, or I could maybe even deploy my own bird watching companion app with all of the above!

Now don't get me wrong, the BirdBuddy app is amazing, intuitive, and really user friendly, but the way it is set up is kind of wonky. The app captures a sighting of a bird as an event, but doesn't always save this event so that you can store it forever. "Sightings" are ephemeral and disappear after a few days if not saved into what BirdBuddy calls your "Collection". So, that means, that if you don't save all of your Sightings into your Collection as they occur, chances are you'll lose that data. For obvious reasons, I wanted to store my sightings data in a permanent solution, which meant extracting that data from the BirdBuddy API into my own data environment, and begin to visualize the data in an application. That's where the fun part starts.

BirdBuddy API and pybirdbuddy

The BirdBuddy app uses a GraphQL API to pass data around, which means, we can use the same API to make GraphQL calls to extract the data. Huge kudos and thanks to jhansche on Github for pybirdbuddy, which provided a lot of code that made authenticating and interacting with the BirdBuddy API simple and easy. You can read more about the Python library in the github repository.

My goal through using the API was to extract all sightings data and store it permanently so it wouldn't be deleted, and to have a raw data layer in S3 full of bird sighting data and images. Here's an example of what the response from the API for a sighting would look like:

"sightingReport": {
"reportToken": "string",
"sightings": [
    {
    "id": "65b43d02-e159-4ef1-a61a-b37a37b29ecd",
    "matchTokens": [
        "string"
    ],
    "__typename": "SightingRecognizedBirdUnlocked",
    "color": "PINK",
    "text": "Knock, knock! Looks like you've got a new first-time visitor!",
    "shareableMatchTokens": [
        "string"
    ],
    "species": {
        "id": "string",
        "iconUrl": "https://assets.cms-api-graphql.cms-api.prod.aws.mybirdbuddy.com/asset/icon/bird/a80b2a00-9a49-46ed-b5f1-5e7bde4b8267_house%20finch.svg",
        "name": "House Finch",
        "__typename": "SpeciesBird",
        "isUnofficialName": false,
        "mapUrl": "https://assets.cms-api-graphql.cms-api.prod.aws.mybirdbuddy.com/asset/map/bird/d8c20684-d366-4933-8ceb-4955ec9459a1_House%20Finch.svg"
    }
    }
],
"__typename": "SightingReport"
}

Once sighting data was received using the pybirdbuddy library, you would then have to post it into a Collection request in order to add it to your collections in the app. However, for my use case, I only really cared about the sighting data as its the most granular form of sighting data. Posting it into your collection would only return less data than the data provided by the sighting report, so I didn't really see a need for capturing both sighting and collection data.

AWS Extract, Load, and Transform Architecture

If you've worked in the cloud before, you know how easy it is to deploy services. The tricky part is getting the configuration and security right, so I won't talk about that here, but 80 to 90 percent of my debugging time usually had to do with permissions, configuration, and security in AWS.

Here's a simplified visual of my AWS ELT architecture for this project: Back end architecture

Not shown in the visual for redundancy sake, but I actually have two S3 buckets: one to store the data in its raw form (JSON), and one that holds the transformed data in parquet format. I use the transformed, parquet data to feed into DuckDB to do some pretty powerful analytics on a very tiny budget (free!). Both S3 buckets are fed by the exact same services in the diagram, just slightly different code. The pipeline in the diagram is scheduled to kick off once a day in a batch process.

AWS Lambda and Elastic Container Registry

I used AWS Lambda and AWS Elastic Container Registry (ECR) as the compute for this project. I had to use ECR to deploy my Lambdas as Docker images because the Python libraries they were dependent on for the extraction had grown over the limit for regular layers in standard Lambdas. However, to get around that, you can deploy the Lambda code as Docker images using ECR and have a much higher limit, which is what I ended up doing. This works out well since now you can containerize your Lambda code, standardize the process, and test the code a bit more efficiently. Also, ECR is easy to use and there is a free tier component to it.

AWS Simple Queue Service

I used AWS Simple Queue Service (SQS) to hold data in transit from Lambda to Lambda. This allowed me to decouple my compute process and also set up some dead letter queues if something were to go wrong. Because my extraction pipeline is relatively simple (only powered by 2 lambdas), I wanted to decouple them so I could run the lambdas asynchronously and hold the data in a temporary storage solution before transforming it further and dumping it into S3. This works really well if you have a varying number of producers and subscribers of data, but in my case for a batch process, it might be overkill.

AWS S3

If you haven't heard of S3 already, I'd be very surprised. It's a really awesome storage tool that can store pretty much anything. I use S3 to store both JSON data and images from my BirdBuddy. As mentioned earlier, I have two S3 buckets: one to store my raw data (both JSON and images), and one to store my clean data for analytics in parquet format. Not quite medallion architecture since I skipped over the silver layer, but for my simple use case, it works out perfectly and helps keep everything within the free tier!

Front-end Architecture

For the front end of the application, I used Streamlit in a Docker container hosted on AWS EC2. Here's a simplified visual of what that looks like for my application: Front end architecture

It's hard to properly visualize the front end since Streamlit runs within Docker, not necessarily in a sequential step. Also, DuckDB is used within Streamlit to run queries on the parquet files in S3, so technically not it's own service, but to make sure I'm on board the hype train, I included all of that in the title. 😁

DuckDB

I use DuckDB as my OLAP tool instead of a conventional database/data warehouse. Basically, I use Pandas to append raw data into my parquet files as new data comes in, and then use DuckDB to query the data.

Here's a snippet of code I took from my transformation (eLt) lambda to give you an idea of how it's done:

data = s3.get_object(Bucket='raw-bucket', Key='sightings/'+f'{m}.json')
content = json.loads(data.get("Body").read())
for medias in content['medias']:
    ts = dt.strptime(medias['createdAt'], "%Y-%m-%dT%H:%M:%S.%fZ").strftime('%Y-%m-%d %H:%M:%S.%f')
if content['sightingReport']['sightings'][0]['__typename'].lower() != 'sightingcantdecidewhichbird':
    for sighting in content['sightingReport']['sightings']:
        new_row = {"name": sighting['species']['name'], "text": sighting['text'], "activityDate": ts, "source": f'{m}.json'}
        logging.info(f'Bird data: {new_row}')
        df = pd.concat([df, pd.DataFrame([new_row])], ignore_index=True)
    df.to_parquet(path='s3://reporting-bucket')

And here's a snippet of code of how I use DuckDB once the data is transformed and dumped into my parquet files within the Streamlit application:

df = pd.read_parquet('s3://reporting-bucket')
conn = duckdb.connect(database=':memory:', read_only=False)
conn.execute("CREATE TABLE sightings AS SELECT * FROM df")
count = conn.execute("""SELECT name, MAX(activityDate) as lastVisited, COUNT(*) as count FROM sightings GROUP BY name ORDER BY lastVisited DESC""").df()

Streamlit's visuals revolves a lot around Pandas dataframes, so using DuckDB to write SQL queries to create Pandas dataframes is way easier and faster (in my opinion) than using Pandas dataframe API to achieve the same. Using DuckDB also saves me from provisioning and maintaining a database in the cloud, as all of my data is parquet based, making reads very fast and efficient.

Saturday, November 11, 2023
in Technology
5 min read

Creating simple DAGs with Networkx

This blog post expects that the reader has some experience with Python.

Any time a data engineer hears the word "visualization" or "dashboard" fear instantly sets in. However, sometimes a basic visualization or chart is the best way to convey or explain an idea.

Imagine trying to explain a highly complex data pipeline, in which there are many dependencies and steps, without the help of visual aids. I'm sure all of us would agree that it is easier to explain and to understand something with a picture of it in front of you.

Thus, we can use networkx and matplotlib in Python to create some pretty cool dependency maps and DAGs for pipelines. If you're unfamiliar with DAGs, read about them here.

I have embedded all relevant snippets of code to this blog post, but if you'd prefer the Github repository, feel free to clone it here.

Dependency maps

First off, you'll want to make sure you are using at least Python 3 and install the following packages using pip or requirements.txt in the repository:

pip install matplotlib
pip install networkx
pip install requests

Once you've done that, let's go ahead and create some dummy data using a wordbank and the requests library:

import networkx as nx
import matplotlib.pyplot as plt
import random
import requests

# get random work bank
word_site = "https://www.mit.edu/~ecprice/wordlist.10000"

response = requests.get(word_site)
words = response.content.splitlines()

# convert byte into string
words = [word.decode() for word in words]

After sending out your request, you'll receive a list full of 10,000 words that you can choose from. Below is an example of the output of your words object:

>>> words
['a', 'aa', 'aaa', 'aaron', 'ab', ...]

We will be using the random library to randomly choose some words to create a data structure, simulating what some dependencies would look like if they were to be materialized as a data object in Python:

# append .csv for file-like names
files = [random.choice(words)+'.csv' for i in range(0,4)]

# create data structure for dependencies
dictionary = {}
for file in files:
    dictionary[file] = [random.choice(words) for i in range(0,2)]

Below is an example of the output of the dictionary object in which the dictionary key is your file and its dependencies are a list of tags for each file. Remember that we are using random to choose our words, so your output will most likely look different than mine. Conceptually, we can think of the items in the list as tags for each file, or its dependencies/children, where adaptor.csv has child tags expanding and beat, and so on:

>>> dictionary
{'adaptor.csv': ['expanding', 'beat'], 'playing.csv': ['cup', 'personality'], 'unable.csv': ['garlic', 'periodic'], 'mortgages.csv': ['hazards', 'requests']}

From there, let's create our graph using Networkx, while also adding an additional layer of dependencies for complexity. Remember that edges (links) represent a relationship between nodes (objects):

# create our edge nodes according to our data structure
G = nx.DiGraph()
for files,tags in dictionary.items(): # iterating through dictionary items
    for tag in tags:
        G.add_edge(tag,files) # creating relationship between tag and files in our graph
        G.add_edge(random.choice(words),tag) # adding one more layer of dependencies

Once we have the basic graph structure fleshed out, we can now visualize our graph with the help of networkx and matplotlib:

# prettying it up
pos = nx.spring_layout(G,k=0.4)
args = dict(node_size=400,alpha=0.4,font_size=8,with_labels=True,node_color='b')
nx.draw(G, pos, **args)
plt.savefig('G.png',format='PNG') # saving figure to use picture later
plt.show()
plt.clf() # this closes the graph

When I ran the code above, this is the image that was generated:

graph

The arrows represent edges and the circles are nodes. We can now see all of our parent files and child tags visually represented. However, the graph is a little hard to read because of the layout. networkx provides different layouts that you can use depending on how you want to structure your visual graph:

# shell graph
pos = nx.shell_layout(G)
args = dict(node_size=400,alpha=0.4,font_size=8,with_labels=True,node_color='b')
nx.draw(G, pos, **args)
plt.savefig('G_shell.png',format='PNG') # saving figure to use picture later
plt.show()
plt.clf() # this closes the graph

The code above produced the following image:

shell graph

Play around with the different layouts and visual parameters in the networkx documentation to best help you in getting whatever point across to your colleagues or stakeholders.

DAGs

Now, let's say you have a pipeline and want to create a visual DAG. We can easily create a DAG using Networkx. First, let's create our DiGraph object and create our nodes and edges simulating what a pipeline might look like:

G = nx.DiGraph()
G.add_node('ingest_from_s3.py')
G.add_edge('ingest_from_s3.py','load_from_s3.py')
G.add_edge('load_from_s3.py','validate_data.py')
G.add_edge('load_from_s3.py','clean_data.py')
G.add_edge('clean_data.py','dump_into_snowflake.py')
G.add_edge('validate_data.py','dump_into_snowflake.py')

Now, let's sort our nodes by using a topological sort to display our nodes in order of appearance and hierarchy:

for layer, nodes in enumerate(nx.topological_generations(G)):
    for node in nodes:
        G.nodes[node]["layer"] = layer

pos = nx.multipartite_layout(G, subset_key="layer")

From here, let's pass in our parameters and create our graph:

args = dict(node_size=400,alpha=0.4,font_size=8,with_labels=True,node_color='b',arrows=True)
plt.figure(figsize=(9,9))
nx.draw(G, pos, **args)
plt.savefig('G_dag.png',format='PNG') # saving figure to use picture later
plt.show()
plt.clf() # this closes the graph

The above code produces the following image:

dag graph

Boom! There you go. Who needs Canva and draw.io when you've got Networkx?