際際滷

際際滷Share a Scribd company logo
Servo & SQLGitHub
A JOURNEY OF TRANSFORMING FAILURES INTO A GOOD IDEA
Servo
 A new browser engine written from scratch in Rust by Mozilla
 Original Project: Write an Android frontend for servo
 Issues, Issues and issues
 Zero knowledge about mobile development and Rust
 Outdated documentation
 Very poorly supported on Android
 Tried approaching in various ways, but to no avail
 Made build and packaging guide and reported issues
SQLGitHub  Motivation (I)
 Who is stealing all my easy bugs/issues?
 There are very limited tools for managing GitHub organizations
 Open-source organizations are usually understaffed
 The Servo project on GitHub for example,
contains 129 repositories
managed by practically 1 person.
SQLGitHub  Background
 Organization: Organizations are shared accounts where businesses
and open-source projects can collaborate across many projects at
once.
 Repository: A repository contains all of the project files, and stores
each file's revision history.
 Commit: An individual change to a set of files.
 Issue: Suggested improvements, tasks or questions
 Pull Request: Proposed change
to a repository.
Repository
Organization
Repository Repository
 Commits
 Issues
 Pull Requests
 
 Commits
 Issues
 Pull Requests
 
 Commits
 Issues
 Pull Requests
 https://help.github.com/articles/github-glossary/
SQLGitHub  Motivation (II)
 Common questions/problems an organization admin include:
 Obtain certain metrics of the organization in machine-friendly format for
post-processing (eg. KPI report)
 Get the current list of projects hosted on GitHub
 List of the most popular repositories in the organization
 Get the list of issues closed (resolved) for the past 7 days
 What are the critical issues that are still left open?
 Who are the top contributors of the past month?
  endless possible questions
Abstract
 SQLGitHub features a SQL-like syntax that allows you to:
Query information about an organization as a whole.
 You may also think of it as a better, enhanced frontend layer built
on top of GitHubs RESTful API
Introduction  Supported Schema
 SELECT
select_expr [, select_expr ...]
FROM {org_name | org_name.{repos | issues | pulls | commits}}
[WHERE where_condition]
[GROUP BY {col_name | expr}
[ASC | DESC], ...]
[HAVING where_condition]
[ORDER BY {col_name | expr}
[ASC | DESC], ...]
[LIMIT row_count]
Introduction  Use Case (I)
 Get name and description from all the repos in apple.
 select name, description from apple.repos
Introduction  Use Case (II)
 Get last-updated time and title of the issues closed in the past week (7
days) in servo listed in descending order of last-updated time.
 select updated_at, title from servo.issues.closed.7 order by updated_at desc
Introduction  Use Case (III)
 Get top 10 most-starred repositories in servo.
 select concat(concat("(", stargazers_count, ") ", name), ": ", description) from
servo.repos order by stargazers_count desc, name limit 10
Introduction  Use Case (IV)
 Get top 10 contributors in servo for the past month (30 days) based
on number of commits.
 select login, count(login) from servo.commits.30 group by login order by
count(login) desc, login limit 10
Introduction  Technology Stack
 Python
 re & regex, regular expression libraries
 PyGithub (patched), an unofficial client library for GitHub API
 prompt_toolkit, a library for building prompts
 pygments, a library for syntax highlighting
Introduction  (Simplified) Flow
Fetch data (from)
Filter by where conditions
Evaluate partial exprs
Group by group exprs
Order by order exprs
Evaluate select exprs
Fetch data with required fields from GitHub API
Evaluate where conditions and filter fetched data
Evaluate group exprs and other field exprs
Generate table groups by values of group exprs
Sort within and between tables
Evaluate select exprs
Filter by having conditions Evaluate having conditions and filter tables
Introduction  Architecture
top_level
parser
tokenizersession
grouping ordering
table_fetcher expression
definitionutilitiestable PyGithub
process user input
process external data
fetches external data
base classes, functions
and SQL definitions
Introduction  Challenges (I)
 Algorithm of parsing is almost identical to that of expression
evaluation  waste of time
 Lazy Parsing: Only parse clauses (eg. select, from, where) and comma-
separated fields
 Comma-separated fields, strings and escape characters
Evaluate this: concat("[)"Stars"(: ", stargazers_count)
 concat("[)"Stars"(: ", stargazers_count)
concat("[)"Stars"(: ", stargazers_count)
 concat("[)"Stars"(: ", stargazers_count)
Introduction  Challenges (II)
 Extracting all relevant fields from expressions to fetch at once
 select concat("[)"-> avg(stargazers_count)"(: ", stargazers_count -
avg(stargazers_count), "] ", name) from apple.repos where description
like "%library%" order by id
 Algorithm: for each expression,
 Remove all literal strings. Use r""(?:[^"]|.)*"" to match.
 Find all possible tokens with r"([a-zA-Z_]+)(?:[^(a-zA-Z_]|$)".
 For each token, check if its a predefined token (ie. part of SQL).
Introduction  Challenges (III)
 Expression Evaluation is really complicated
 Regular (eg. concat, floor) and Aggregate functions (eg. max, min)
 Have to evaluate an entire table at once
 Nested functions (eg. sum(avg(field_a) + avg(field_b)))
 Use recursive regex patterns to extract tokens  r((?:(?>[^()]+|(?R))*))
 Assign special precedence and insert extra logic in place
 Operator Precedence
 Modified 2-stack evaluation approach +
 Finite State Machine + One-token Lookahead
Introduction  Challenges (IV)
 Pythons built-in sort is not customizable:
sorted(iterable, *, key=None, reverse=False)
 order by requires sorting with multiple keys each with potentially
different reverse:
order by field_a desc, field_b asc, field_c, desc
 Wrote custom sort that integrates better with the workflow
Future Directions
 Improve SQL, MySQL compatibility
 Extend to end users not just organizations
 Migrate to the new GraphQL backend (GitHub API v4)
 Integrate SQLGitHub directly on the server end (better efficiency
and perhaps better security!)
Acknowledgements
 We would like to thank:
 Shing Lyu, former software engineer at Mozilla Taiwan for the mentorship
 Irvin Chen, Liaison of MozTW (Mozilla Taiwan Community) for
coordinating the program
 Prof. Cheng-Chung Lin for organizing the program

More Related Content

SQLGitHub - Access GitHub API with SQL-like syntaxes

  • 1. Servo & SQLGitHub A JOURNEY OF TRANSFORMING FAILURES INTO A GOOD IDEA
  • 2. Servo A new browser engine written from scratch in Rust by Mozilla Original Project: Write an Android frontend for servo Issues, Issues and issues Zero knowledge about mobile development and Rust Outdated documentation Very poorly supported on Android Tried approaching in various ways, but to no avail Made build and packaging guide and reported issues
  • 3. SQLGitHub Motivation (I) Who is stealing all my easy bugs/issues? There are very limited tools for managing GitHub organizations Open-source organizations are usually understaffed The Servo project on GitHub for example, contains 129 repositories managed by practically 1 person.
  • 4. SQLGitHub Background Organization: Organizations are shared accounts where businesses and open-source projects can collaborate across many projects at once. Repository: A repository contains all of the project files, and stores each file's revision history. Commit: An individual change to a set of files. Issue: Suggested improvements, tasks or questions Pull Request: Proposed change to a repository. Repository Organization Repository Repository Commits Issues Pull Requests Commits Issues Pull Requests Commits Issues Pull Requests https://help.github.com/articles/github-glossary/
  • 5. SQLGitHub Motivation (II) Common questions/problems an organization admin include: Obtain certain metrics of the organization in machine-friendly format for post-processing (eg. KPI report) Get the current list of projects hosted on GitHub List of the most popular repositories in the organization Get the list of issues closed (resolved) for the past 7 days What are the critical issues that are still left open? Who are the top contributors of the past month? endless possible questions
  • 6. Abstract SQLGitHub features a SQL-like syntax that allows you to: Query information about an organization as a whole. You may also think of it as a better, enhanced frontend layer built on top of GitHubs RESTful API
  • 7. Introduction Supported Schema SELECT select_expr [, select_expr ...] FROM {org_name | org_name.{repos | issues | pulls | commits}} [WHERE where_condition] [GROUP BY {col_name | expr} [ASC | DESC], ...] [HAVING where_condition] [ORDER BY {col_name | expr} [ASC | DESC], ...] [LIMIT row_count]
  • 8. Introduction Use Case (I) Get name and description from all the repos in apple. select name, description from apple.repos
  • 9. Introduction Use Case (II) Get last-updated time and title of the issues closed in the past week (7 days) in servo listed in descending order of last-updated time. select updated_at, title from servo.issues.closed.7 order by updated_at desc
  • 10. Introduction Use Case (III) Get top 10 most-starred repositories in servo. select concat(concat("(", stargazers_count, ") ", name), ": ", description) from servo.repos order by stargazers_count desc, name limit 10
  • 11. Introduction Use Case (IV) Get top 10 contributors in servo for the past month (30 days) based on number of commits. select login, count(login) from servo.commits.30 group by login order by count(login) desc, login limit 10
  • 12. Introduction Technology Stack Python re & regex, regular expression libraries PyGithub (patched), an unofficial client library for GitHub API prompt_toolkit, a library for building prompts pygments, a library for syntax highlighting
  • 13. Introduction (Simplified) Flow Fetch data (from) Filter by where conditions Evaluate partial exprs Group by group exprs Order by order exprs Evaluate select exprs Fetch data with required fields from GitHub API Evaluate where conditions and filter fetched data Evaluate group exprs and other field exprs Generate table groups by values of group exprs Sort within and between tables Evaluate select exprs Filter by having conditions Evaluate having conditions and filter tables
  • 14. Introduction Architecture top_level parser tokenizersession grouping ordering table_fetcher expression definitionutilitiestable PyGithub process user input process external data fetches external data base classes, functions and SQL definitions
  • 15. Introduction Challenges (I) Algorithm of parsing is almost identical to that of expression evaluation waste of time Lazy Parsing: Only parse clauses (eg. select, from, where) and comma- separated fields Comma-separated fields, strings and escape characters Evaluate this: concat("[)"Stars"(: ", stargazers_count) concat("[)"Stars"(: ", stargazers_count) concat("[)"Stars"(: ", stargazers_count) concat("[)"Stars"(: ", stargazers_count)
  • 16. Introduction Challenges (II) Extracting all relevant fields from expressions to fetch at once select concat("[)"-> avg(stargazers_count)"(: ", stargazers_count - avg(stargazers_count), "] ", name) from apple.repos where description like "%library%" order by id Algorithm: for each expression, Remove all literal strings. Use r""(?:[^"]|.)*"" to match. Find all possible tokens with r"([a-zA-Z_]+)(?:[^(a-zA-Z_]|$)". For each token, check if its a predefined token (ie. part of SQL).
  • 17. Introduction Challenges (III) Expression Evaluation is really complicated Regular (eg. concat, floor) and Aggregate functions (eg. max, min) Have to evaluate an entire table at once Nested functions (eg. sum(avg(field_a) + avg(field_b))) Use recursive regex patterns to extract tokens r((?:(?>[^()]+|(?R))*)) Assign special precedence and insert extra logic in place Operator Precedence Modified 2-stack evaluation approach + Finite State Machine + One-token Lookahead
  • 18. Introduction Challenges (IV) Pythons built-in sort is not customizable: sorted(iterable, *, key=None, reverse=False) order by requires sorting with multiple keys each with potentially different reverse: order by field_a desc, field_b asc, field_c, desc Wrote custom sort that integrates better with the workflow
  • 19. Future Directions Improve SQL, MySQL compatibility Extend to end users not just organizations Migrate to the new GraphQL backend (GitHub API v4) Integrate SQLGitHub directly on the server end (better efficiency and perhaps better security!)
  • 20. Acknowledgements We would like to thank: Shing Lyu, former software engineer at Mozilla Taiwan for the mentorship Irvin Chen, Liaison of MozTW (Mozilla Taiwan Community) for coordinating the program Prof. Cheng-Chung Lin for organizing the program