ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
Invalidating Copyright
Infringement Claims with
Python and Fuzzy
Hashing
Joe T. Sylve, M.S.

Managing Partner
504ENSICS Labs
Background
? Client was being sued for Copyright Infringement
? Client¡¯s lawyer wanted two questions answered
? Does the code contain any open source or GPL code?
? When was the code in question written?

? Code was written in PHP (web-based application)
? Code had absolutely no comments
? No copyright headers
? No dates of any kind

www.504ensics.com
Goal
? If it can be proven that the code contains open
source or GPL code with restrictive licenses then
the claim in invalid
? If it can be proven that the copyright code on file
was written after the author¡¯s claimed ¡°creation
date¡±, Copyright is invalid

www.504ensics.com
Is code original?
? No comments or header¡¯s that would imply
authorship
? Code didn¡¯t look familiar
? Code was kind of crappy

www.504ensics.com
Step 1 ¨C Acquire Samples
? Wrote Python script to download all projects
written in PHP from Github
? Scraped from search feature
? Limited to 50 pages of search

? Got something like 10GB of compressed code
? ~100,000 files

www.504ensics.com
Step 2 ¨C Compare Code
? Three Options
? Manual Verification
? Grad Students, Interns, etc

? Cryptographic Hashing
? MD5, SHA-1, etc

? ¡°Fuzzy¡± Hashing
? ssdeep, sdhash

www.504ensics.com
Fuzzy Hashing
? Vassil says I have to call it ¡°Approximate Matching¡±
? Ssdeep
? Vassil Roussev & Candace Quates
? Free, Open Source
? Awesome

? Traditional hashing
? If a single bit of the input changes, the whole hash
changes

? Fuzzy Hashing
? Compares files and gives similarity index
? Can find ¡°similar¡± files
www.504ensics.com
When was code written?
? We can invalidate copyright if the sample on file
was written after the claimed authorship date
? No comments or dates of any kind in the code!
? No access to developer¡¯s workstation to do
traditional forensics
? ???

www.504ensics.com
PHP
? Web-based language
? Updated reasonably frequently
? New Features added often
? Goal
? Determine which features were used in the code
? Correlate features with PHP release date
? Code couldn¡¯t have been written before this date

www.504ensics.com
Step 1 ¨C Function Use
? Programmer can create own functions or use ones
available in the language
? Ex
? function plus_one($x) { return $x + 1; }

? Python script to find all function declarations and
calls
? Ignore declared functions
? Left with a list of language ¡°features¡± used

www.504ensics.com
Step 2 ¨C Version Detection
? PHP comes with auto-generated documentation
about each built-in function
? Documentation says which version each function
became first available
? Write python script to scrape PHP documentation
? Correlate functions with PHP versions
? We only care about the function with the newest
version

www.504ensics.com
Step 3 ¨C Date the code
? PHP has an archive of release notes on their
website
? Contains release versions and dates
? Python script scrapes release notes for the PHP
version of interest and gives us the release date
? Reasonably, the code couldn¡¯t have been written
before that date

www.504ensics.com
Step 4 ¨C Profit
? Win!
? Code in question used features first available in
PHP 5.1.5
? Release date 17-Aug-2006
? This was after the claimed creation date

www.504ensics.com
Conclusion
? Sometimes you can¡¯t depend solely on existing
tools
? Learn to program even if you¡¯re not a
¡°programmer¡±
? PHP sucks
? Fuzzy Hashing and Python is Cool

www.504ensics.com

More Related Content

Invalidating copyright infringement claims

  • 1. Invalidating Copyright Infringement Claims with Python and Fuzzy Hashing Joe T. Sylve, M.S. Managing Partner 504ENSICS Labs
  • 2. Background ? Client was being sued for Copyright Infringement ? Client¡¯s lawyer wanted two questions answered ? Does the code contain any open source or GPL code? ? When was the code in question written? ? Code was written in PHP (web-based application) ? Code had absolutely no comments ? No copyright headers ? No dates of any kind www.504ensics.com
  • 3. Goal ? If it can be proven that the code contains open source or GPL code with restrictive licenses then the claim in invalid ? If it can be proven that the copyright code on file was written after the author¡¯s claimed ¡°creation date¡±, Copyright is invalid www.504ensics.com
  • 4. Is code original? ? No comments or header¡¯s that would imply authorship ? Code didn¡¯t look familiar ? Code was kind of crappy www.504ensics.com
  • 5. Step 1 ¨C Acquire Samples ? Wrote Python script to download all projects written in PHP from Github ? Scraped from search feature ? Limited to 50 pages of search ? Got something like 10GB of compressed code ? ~100,000 files www.504ensics.com
  • 6. Step 2 ¨C Compare Code ? Three Options ? Manual Verification ? Grad Students, Interns, etc ? Cryptographic Hashing ? MD5, SHA-1, etc ? ¡°Fuzzy¡± Hashing ? ssdeep, sdhash www.504ensics.com
  • 7. Fuzzy Hashing ? Vassil says I have to call it ¡°Approximate Matching¡± ? Ssdeep ? Vassil Roussev & Candace Quates ? Free, Open Source ? Awesome ? Traditional hashing ? If a single bit of the input changes, the whole hash changes ? Fuzzy Hashing ? Compares files and gives similarity index ? Can find ¡°similar¡± files www.504ensics.com
  • 8. When was code written? ? We can invalidate copyright if the sample on file was written after the claimed authorship date ? No comments or dates of any kind in the code! ? No access to developer¡¯s workstation to do traditional forensics ? ??? www.504ensics.com
  • 9. PHP ? Web-based language ? Updated reasonably frequently ? New Features added often ? Goal ? Determine which features were used in the code ? Correlate features with PHP release date ? Code couldn¡¯t have been written before this date www.504ensics.com
  • 10. Step 1 ¨C Function Use ? Programmer can create own functions or use ones available in the language ? Ex ? function plus_one($x) { return $x + 1; } ? Python script to find all function declarations and calls ? Ignore declared functions ? Left with a list of language ¡°features¡± used www.504ensics.com
  • 11. Step 2 ¨C Version Detection ? PHP comes with auto-generated documentation about each built-in function ? Documentation says which version each function became first available ? Write python script to scrape PHP documentation ? Correlate functions with PHP versions ? We only care about the function with the newest version www.504ensics.com
  • 12. Step 3 ¨C Date the code ? PHP has an archive of release notes on their website ? Contains release versions and dates ? Python script scrapes release notes for the PHP version of interest and gives us the release date ? Reasonably, the code couldn¡¯t have been written before that date www.504ensics.com
  • 13. Step 4 ¨C Profit ? Win! ? Code in question used features first available in PHP 5.1.5 ? Release date 17-Aug-2006 ? This was after the claimed creation date www.504ensics.com
  • 14. Conclusion ? Sometimes you can¡¯t depend solely on existing tools ? Learn to program even if you¡¯re not a ¡°programmer¡± ? PHP sucks ? Fuzzy Hashing and Python is Cool www.504ensics.com