This job was posted over 90 days ago and may no longer be available.

Data Engineer

Creative Commons is building a “front door” to the growing universe of openly licensed and public domain content through CC Search and the CC Catalog API. The Data Engineer reports to the Director of Engineering and is responsible for CC Catalog, the open source catalog that powers those products. This project will unite billions of records for openly-licensed and public domain works and metadata, across multiple platforms, diverse media types, and a variety of user communities and partners.

Primary responsibilities

Architect, build, and maintain the existing CC Catalog, including:

  • Ingesting content from new and existing sources of CC-licensed and public domain works.
  • Scaling the catalog to support billions of records and various media types.
  • Implementing resilient, distributed data solutions that operate robustly at web scale.
  • Automating data pipelines and workflows.
  • Collaborating with the Backend Software Engineer and Front End Engineer to support the smooth operation of the CC Catalog API and CC Search.
  • Augment and improve the metadata associated with content indexed into the catalog using one or more of the following: machine learning, computer vision, OCR, data analysis, web crawling/scraping.

Build an open source community around the CC Catalog, including:

  • Restructuring the code and workflows such that it allows community contributors to identify new sources of content and add new data to the catalog.
  • Guiding new contributors and potentially participating in projects such as Google Summer of Code as a mentor.
  • Writing blog posts, maintaining documentation, reviewing pull requests, and responding to issues from the community.
  • Collaborate with other outside communities, companies, and institutions to further Creative Commons’ mission.
Restrictions
  • Telecommuting is OK
  • No Agencies Please
Requirements
  • Demonstrated experience building and deploying large scale data services, including database design and modeling, ETL processing, and performance optimization
  • Proficiency with Python
  • Proficiency with Apache Spark
  • Experience with cloud computing platforms such as AWS
  • Experience with Apache Airflow or other workflow management software
  • Experience with machine learning or interest in picking it up
  • Fluent in English
  • Excellent written and verbal communication skills
  • Ability to work independently, build good working relationships and actively communicate, contribute, and speak up in a remote work structure
  • Curiosity and a desire to keep learning
  • Commitment to consumer privacy and security

Nice to have (but not required):

  • Experience with contributing to or maintaining open source software
  • Experience with web crawling
  • Experience with Docker
About the Company

Creative Commons is a nonprofit organization that enables the sharing and use of creativity and knowledge through free legal tools. We are a leader in the global movement for free culture and open knowledge with an active global community in over 85 countries. Our free, easy-to-use copyright licenses provide a simple, standardized way to give the public permission to share and use your creative work — from “all rights reserved” to “some rights reserved.” The first phase of CC’s work was about establishing the licenses as standard, and growing the archive. The next phase is building a global movement that will create a more vibrant, usable commons powered by collaboration and gratitude. Today, the global commons stands at over 1.2 billion licensed works, made up of photos, video, audio, datasets, open textbooks, research, 3D models, and more.

Desired Skills

Contact Info

Posted: Aug. 23, 2019

Apply


Get Updates