We take a peak into some of the challenges Twitter has faced while solving data problems at large scale, while Michael challenges the audience, Joe speaks from experience, and Allen blindsides them both.
The full show notes for this episode are available at https://www.codingblocks.net/episode198.
- Want to help us out? Leave us a review!
- The 2023 Game Ja-Ja-Ja Jam is coming up!
Twitter has a Data Problem
Moving an Exabyte of Data
- In 2019, over 100 million people per day would visit Twitter.
- Every tweet and user action creates an event that is used by machine learning and employees for analytics.
- Their goal was to democratize data analysis within Twitter to allow people with various skillsets to analyze and/or visualize the data.
- At the time, various technologies were used for data analysis:
- Scalding which required programmer knowledge, and
- Presto and Vertica which had performance issues at scale.
- Another problem was having data spread across multiple systems without a simple way to access it.
Moving pieces to Google Cloud Platform
- The Google Cloud big data tools at play:
- BigQuery, a cost-effective, serverless, multicloud enterprise data warehouse to power your data-driven innovation.
- DataStudio, unifying data in one place with ability to explore, visualize and tell stories with the data.
History of Data Warehousing at Twitter
- 2011 – Data analysis was done with Vertica and Hadoop and data was ingested using Pig for MapReduce.
- 2012 – Replaced Pig with Scalding using Scala APIs that were geared towards creating complex pipelines that were easy to test. However, it was difficult for people with SQL skills to pick up.
- 2016 – Started using Presto to access Hadoop data using SQL and also used Spark for ad hoc data science and machine learning.
- 2018 …
- Scalding for production pipelines,
- Scalding and Spark for ad hoc data science and machine learning,
- Vertica and Presto for ad hoc, interactive SQL analysis,
- Druid for interactive, exploratory access to time-series metrics, and
- Tableau, Zeppelin, and Pivot for data visualization.
- So why the change? To simplify analytical tools for Twitter employees.
BigQuery for Everyone
- Needed to develop an infrastructure to reliably ingest large amounts of data,
- Support company-wide data management,
- Implement access controls,
- Ensure customer privacy, and
- Build systems for:
- Resource allocation,
- Monitoring, and
- In 2018, they rolled out an alpha release.
- The most frequently used tables were offered with personal data removed.
- Over 250 users, from engineering, finance, and marketing used the alpha.
- Sometime around June of 2019, they had a month where 8,000 queries were run that processed over 100 petabytes of data, not including scheduled reports.
- The alpha turned out to be a large success so they moved forward with more using BigQuery.
- The most frequently used tables were offered with personal data removed.
- They have a nice diagram that’s an overview of what their processes looked like at this time, where they essentially pushed data into GCS from on-premise Hadoop data clusters, and then used Airflow to move that into BigQuery, from which Data Studio pulled its data.
Ease of Use
- BigQuery was easy to use because it didn’t require the installation of special tools and instead was easy to navigate via a web UI.
- Users did need to become familiar with some GCP and BigQuery concepts such as projects, datasets, and tables.
- They developed educational material for users which helped get people up and running with BigQuery and Data Studio.
- In regards to loading data, they looked at various pieces …
- Cloud Composer (managed Airflow) couldn’t be used due to Domain Restricted Sharing (data governance).
- Google Data Transfer Service was not flexible enough for data pipelines with dependencies.
- They ended up using Apache Airflow as they could customize it to their needs.
- For data transformation, once data was in BigQuery, they created scheduled jobs to do simple SQL transforms.
- For complex transformations, they planned to use Airflow or Cloud Composer with Cloud Dataflow.
- BigQuery is not for low-latency, high-throughput queries, or for low-latency, time-series analytics.
- It is for SQL queries that process large amounts of data.
- Their requirements for their BigQuery usage was to return results within a minute.
- To achieve these requirements, they allowed their internal customers to reserve minimum slots for their queries, where a slot is a unit of computational capacity to execute a query.
- The engineering team had to analyze 800+ queries, each processing around 1TB of data, to figure out how to allocate the proper slots for production and other environments.
- Twitter focused on discoverability, access control, security, and privacy.
- For data discovery and management, they extended their DAL to work with both their on-premise and GCP data, providing a single API to query all sets of data.
- In regards to controlling access to the data, they took advantage of two GCP features:
- Domain restricted sharing, meaning only users inside Twitter could access the data, and
- VPC service controls to prevent data exfiltration as well as only allow access from known IP ranges.
Authentication, Authorization, and Auditing
- For authentication, they used GCP user accounts for ad hoc queries and service accounts for production queries.
- For authorization, each dataset had an owner service account and a reader group.
- For auditing, they exported BigQuery stackdriver logs with detailed execution information to BigQuery datasets for analysis.
Ensuring Proper Handling of Private Data
- They required registering all BigQuery datasets,
- Annotate private data,
- Use proper retention, and
- Scrub and remove data that was deleted by users.
Privacy Categories for Datasets
- Highly sensitive datasets are available on an as-needed basis with least privilege.
- These have individual reader groups that are actively monitored.
- Medium sensitivity datasets are anonymized data sets with no PII (Personally identifiable information) and provide a good balance between privacy and utility, such as, how many users used a particular feature without knowing who the users were.
- Low sensitivity datasets are datasets where all user level information is removed.
- Public datasets are available to everyone within Twitter.
- Scheduled tasks were used to register datasets with the DAL, as well as a number of additional things.
- Roughly the same for querying Presto vs BigQuery.
- There are additional costs associated with storing data in GCS and BigQuery.
- Utilized flat-rate pricing so they didn’t have to figure out fluctuating costs of running ad hoc queries.
- In some situations where querying 10’s of petabytes, it was more cost-effective to utilize Presto querying data in GCS storage.
Could you build Twitter in a weekend?
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Characters Sets (No Excuses!) (JoelOnSoftware.com)
- Scaling data access by moving an exabyte of data to Google Cloud (blog.twitter.com)
- Democratizing data analysis with Google BigQuery (blog.twitter.com)
- Google BigQuery, Cloud data warehouse to power your data-driven innovation (cloud.google.com)
- Google Data Studio, Your data is beautiful. Use it. (datastudio.withgoogle.com)
- Looker Studio, formerly Data Studio (datastudio.google.com)
- Stack Overflow’s engineering blog (Stack Overflow)
- Apache Airflow (Wikipedia)
- Stack Exchange performance (Stack Exchange)
- Elon Musk and Twitter employees engage in war of words (NewsBytesApp.com)
The more I think about it, the 10x developer trope is so less rockstar, and so much more shitty covers band than any of those people would like to admit.— David Whitney (@david_whitney) November 17, 2022
Tip of the Week
- VS Code has a plugin for Kubernetes and it’s actually very nice! Particularly when you “attach” to the container. It installs a couple bits on the container, and you can treat it like a local computer. The only thing to watch for … it’s very easy to set your local context! (marketplace.visualstudio.com)
kafkactlis a great command line tool for managing Apache Kafka and has a consistent API that is intuitive to use. (deviceinsight.github.io)
- Cruise Control is a tool for Apache Kafka that helps balance resource utilization, detect and alert on problems, and administrate. (GitHub)
- iTerm2 is a terminal emulator for macOS that does amazing things. Why aren’t you already using it? (iterm2.com)
- Message compression in Kafka will help you save a lot of space and network bandwidth, and the compression is per message so it’s easy to enable in existing systems! (cwiki.apache.org)