Export from GitHub to S3

CloudQuery is an open-source data integration platform that allows you to export data from any source to any destination.

The CloudQuery GitHub plugin allows you to sync data from GitHub to any destination, including S3. It's free, open source, requires no account, and takes only minutes to get started.

Ready? Let's dive right in!

Step 1. Install the CloudQuery CLI

The CloudQuery CLI is a command-line tool that runs the sync. It supports MacOS, Linux and Windows.

brew install cloudquery/tap/cloudquery

Step 2. Configure the GitHub source plugin

Create a configuration file for the GitHub plugin and set up authentication.

Configuration

Create a file called github.yaml and add the following contents:

To configure CloudQuery to extract from GitHub, create a .yml file in your CloudQuery configuration directory. The following configuration will extract information from the cloudquery/cloudquery repository:

kind: source
spec:
  # Source spec section
  name: github
  path: cloudquery/github
  version: "v5.2.0"
  tables: ["*"]
  destinations: ["s3"]
  spec:
    access_token: <YOUR_ACCESS_TOKEN_HERE> # Personal Access Token, required if not using App Authentication.
    ## App Authentication (one per org):
    # app_auth:
    # - org: cloudquery
    #   private_key_path: <PATH_TO_PRIVATE_KEY> # Path to private key file
    #   app_id: <YOUR_APP_ID> # App ID, required for App Authentication.
    #   installation_id: <ORG_INSTALLATION_ID> # Installation ID for this org
    orgs: [] # Optional. List of organizations to extract from
    repos: ["cloudquery/cloudquery"] # Optional. List of repositories to extract from
    ## GitHub Enterprise
    # In order to enable GHE you have to provide two urls, the base url of the server and the upload url.
    # Quote from GitHub's client:
    #   If the base URL does not have the suffix "/api/v3/", it will be added automatically. If the upload URL does not have the suffix "/api/uploads", it will be added automatically.
    #   Another important thing is that by default, the GitHub Enterprise URL format should be http(s)://[hostname]/api/v3/ or you will always receive the 406 status code. The upload URL format should be http(s)://[hostname]/api/uploads/"
    # If you are not configuring against an enterprise server please omit the enterprise stanza bellow
    enterprise:
        base_url: "http(s)://[your-ghe-hostname]/api/v3/"
        upload_url: "http(s)://[your-ghe-hostname]/api/uploads/"

You must specify either orgs or repos in the configuration. If a repository is specified in both orgs and repos, it will be extracted only once, and other repositories from that organization will be ignored.

It is recommended that you use environment variable expansion for the access token in production. For example, if the access token is stored in an environment variable called GITHUB_ACCESS_TOKEN:

spec:
  access_token: ${GITHUB_ACCESS_TOKEN}

Fine-tune this configuration to match your needs. For more information, see the GitHub Plugin ↗ page in the docs.

Authentication

The GitHub source plugin supports two authentication methods: Personal Access Token and App authentication. Which one you use is up to and the security requirements of your organization.

CloudQuery requires only read permissions (we will never make any changes to your GitHub account or organizations), so, following the principle of least privilege, it's recommended to grant it read-only permissions to all the resources you wish to sync.

Personal Access Token

Follow this guide (opens in a new tab) on how to create a personal access token for CloudQuery.

App authentication

For App authentication, you need to create a GitHub App and install it on your organization. Follow this guide (opens in a new tab) and install the App into your organization(s). Give it all the permissions you need (read-only is recommended).

Every organization will have a unique installation ID. You can find it by going to the organization's settings page, and clicking on the "Installed GitHub Apps" tab. The installation ID is the number in the URL of the page.

Step 3. Configure the S3 destination plugin

Create a configuration file for the S3 plugin and set up authentication.

Configuration

Create a file called s3.yaml and add the following contents:

This example uses the parquet format, to create parquet files in s3://bucket_name/path/to/files, with each table placed in its own directory.

The (top level) spec section is described in the Destination Spec Reference.

kind: destination
spec:
  name: "s3"
  path: "cloudquery/s3"
  version: "v4.3.0"
  write_mode: "append" # s3 only supports 'append' mode
  # batch_size: 10000 # optional
  # batch_size_bytes: 5242880 # optional
  spec:
    bucket: "bucket_name"
    region: "region-name" # Example: us-east-1
    path: "path/to/files/{{TABLE}}/{{UUID}}.parquet"
    format: "parquet"
    athena: false # <- set this to true for Athena compatibility

It is also possible to use {{YEAR}}, {{MONTH}}, {{DAY}} and {{HOUR}} in the path to create a directory structure based on the current time. For example:

path: "path/to/files/{{TABLE}}/dt={{YEAR}}-{{MONTH}}-{{DAY}}/{{UUID}}.parquet"

Other supported formats are json and csv.

Fine-tune this configuration to match your needs. For more information, see the S3 Plugin ↗ page in the docs.

Authentication

The plugin needs to be authenticated with your account(s) in order to sync information from your cloud setup.

The plugin requires only PutObject permissions (we will never make any changes to your cloud setup), so, following the principle of least privilege, it's recommended to grant it PutObject permissions.

There are multiple ways to authenticate with AWS, and the plugin respects the AWS credential provider chain. This means that CloudQuery will follow the following priorities when attempting to authenticate:

The AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN environment variables.
The credentials and config files in ~/.aws (the credentials file takes priority).
You can also use aws sso to authenticate cloudquery - you can read more about it here (opens in a new tab).
IAM roles for AWS compute resources (including EC2 instances, Fargate and ECS containers).

You can read more about AWS authentication here (opens in a new tab) and here (opens in a new tab).

Environment Variables

CloudQuery can use the credentials from the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN environment variables (AWS_SESSION_TOKEN can be optional for some accounts). For information on obtaining credentials, see the AWS guide (opens in a new tab).

To export the environment variables (On Linux/Mac - similar for Windows):

export AWS_ACCESS_KEY_ID={Your AWS Access Key ID}
export AWS_SECRET_ACCESS_KEY={Your AWS secret access key}
export AWS_SESSION_TOKEN={Your AWS session token}

Shared Configuration files

The plugin can use credentials from your credentials and config files in the .aws directory in your home folder. The contents of these files are practically interchangeable, but CloudQuery will prioritize credentials in the credentials file.

For information about obtaining credentials, see the AWS guide (opens in a new tab).

Here are example contents for a credentials file:

~/.aws/credentials

[default]
aws_access_key_id = YOUR_ACCESS_KEY_ID
aws_secret_access_key = YOUR_SECRET_ACCESS_KEY

You can also specify credentials for a different profile, and instruct CloudQuery to use the credentials from this profile instead of the default one.

For example:

~/.aws/credentials

[myprofile]
aws_access_key_id = YOUR_ACCESS_KEY_ID
aws_secret_access_key = YOUR_SECRET_ACCESS_KEY

Then, you can either export the AWS_PROFILE environment variable (On Linux/Mac, similar for Windows):

export AWS_PROFILE=myprofile

IAM Roles for AWS Compute Resources

The plugin can use IAM roles for AWS compute resources (including EC2 instances, Fargate and ECS containers). If you configured your AWS compute resources with IAM, the plugin will use these roles automatically. For more information on configuring IAM, see the AWS docs here (opens in a new tab) and here (opens in a new tab).

User Credentials with MFA

In order to leverage IAM User credentials with MFA, the STS "get-session-token" command may be used with the IAM User's long-term security credentials (Access Key and Secret Access Key). For more information, see here (opens in a new tab).

aws sts get-session-token --serial-number <YOUR_MFA_SERIAL_NUMBER> --token-code <YOUR_MFA_TOKEN_CODE> --duration-seconds 3600

Then export the temporary credentials to your environment variables.

export AWS_ACCESS_KEY_ID=<YOUR_ACCESS_KEY_ID>
export AWS_SECRET_ACCESS_KEY=<YOUR_SECRET_ACCESS_KEY>
export AWS_SESSION_TOKEN=<YOUR_SESSION_TOKEN>

Using a Custom S3 Endpoint

If you are using a custom S3 endpoint, you can specify it using the endpoint spec option. If you're using authentication, the region option in the spec determines the signing region used.

Step 4. Start the Sync

Run the following command in your terminal to start the sync

cloudquery sync github.yaml s3.yaml

And away we go! 🚀 The sync will run until completion, fetching all selected tables from GitHub. Any errors will be logged to a file called cloudquery.log.

Export from GitHub to S3

Step 1. Install the CloudQuery CLI

Step 2. Configure the GitHub source plugin

Configuration

Authentication

Personal Access Token

App authentication

Step 3. Configure the S3 destination plugin

Configuration

Authentication

Environment Variables

Shared Configuration files

IAM Roles for AWS Compute Resources

User Credentials with MFA

Using a Custom S3 Endpoint

Step 4. Start the Sync

Further Reading

Environment Variables

Shared Configuration files

IAM Roles for AWS Compute Resources

User Credentials with MFA

Using a Custom S3 Endpoint

Personal Access Token

App authentication