The document describes using Python and fuzzy hashing to analyze PHP code that was being sued for copyright infringement. To determine if the code was original, it scraped over 100,000 PHP files from GitHub and used fuzzy hashing to compare them to the code in question. To determine when the code was written, it identified PHP functions used and correlated them to release dates to find the code was written after the claimed creation date, invalidating the copyright claim.
Convert to study materialsBETA
Transform any presentation into ready-made study material¡ªselect from outputs like summaries, definitions, and practice questions.
2. Background
? Client was being sued for Copyright Infringement
? Client¡¯s lawyer wanted two questions answered
? Does the code contain any open source or GPL code?
? When was the code in question written?
? Code was written in PHP (web-based application)
? Code had absolutely no comments
? No copyright headers
? No dates of any kind
www.504ensics.com
3. Goal
? If it can be proven that the code contains open
source or GPL code with restrictive licenses then
the claim in invalid
? If it can be proven that the copyright code on file
was written after the author¡¯s claimed ¡°creation
date¡±, Copyright is invalid
www.504ensics.com
4. Is code original?
? No comments or header¡¯s that would imply
authorship
? Code didn¡¯t look familiar
? Code was kind of crappy
www.504ensics.com
5. Step 1 ¨C Acquire Samples
? Wrote Python script to download all projects
written in PHP from Github
? Scraped from search feature
? Limited to 50 pages of search
? Got something like 10GB of compressed code
? ~100,000 files
www.504ensics.com
7. Fuzzy Hashing
? Vassil says I have to call it ¡°Approximate Matching¡±
? Ssdeep
? Vassil Roussev & Candace Quates
? Free, Open Source
? Awesome
? Traditional hashing
? If a single bit of the input changes, the whole hash
changes
? Fuzzy Hashing
? Compares files and gives similarity index
? Can find ¡°similar¡± files
www.504ensics.com
8. When was code written?
? We can invalidate copyright if the sample on file
was written after the claimed authorship date
? No comments or dates of any kind in the code!
? No access to developer¡¯s workstation to do
traditional forensics
? ???
www.504ensics.com
9. PHP
? Web-based language
? Updated reasonably frequently
? New Features added often
? Goal
? Determine which features were used in the code
? Correlate features with PHP release date
? Code couldn¡¯t have been written before this date
www.504ensics.com
10. Step 1 ¨C Function Use
? Programmer can create own functions or use ones
available in the language
? Ex
? function plus_one($x) { return $x + 1; }
? Python script to find all function declarations and
calls
? Ignore declared functions
? Left with a list of language ¡°features¡± used
www.504ensics.com
11. Step 2 ¨C Version Detection
? PHP comes with auto-generated documentation
about each built-in function
? Documentation says which version each function
became first available
? Write python script to scrape PHP documentation
? Correlate functions with PHP versions
? We only care about the function with the newest
version
www.504ensics.com
12. Step 3 ¨C Date the code
? PHP has an archive of release notes on their
website
? Contains release versions and dates
? Python script scrapes release notes for the PHP
version of interest and gives us the release date
? Reasonably, the code couldn¡¯t have been written
before that date
www.504ensics.com
13. Step 4 ¨C Profit
? Win!
? Code in question used features first available in
PHP 5.1.5
? Release date 17-Aug-2006
? This was after the claimed creation date
www.504ensics.com
14. Conclusion
? Sometimes you can¡¯t depend solely on existing
tools
? Learn to program even if you¡¯re not a
¡°programmer¡±
? PHP sucks
? Fuzzy Hashing and Python is Cool
www.504ensics.com