Metadata-Version: 2.1 Name: mrjob Version: 0.7.4 Summary: Python MapReduce framework Home-page: http://github.com/Yelp/mrjob Author: David Marin Author-email: dm@davidmarin.org License: Apache Platform: UNKNOWN Classifier: Development Status :: 5 - Production/Stable Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: Apache Software License Classifier: Natural Language :: English Classifier: Operating System :: OS Independent Classifier: Programming Language :: Python Classifier: Programming Language :: Python :: 2 Classifier: Programming Language :: Python :: 2.7 Classifier: Programming Language :: Python :: 3 Classifier: Programming Language :: Python :: 3.4 Classifier: Programming Language :: Python :: 3.5 Classifier: Programming Language :: Python :: 3.6 Classifier: Programming Language :: Python :: 3.7 Classifier: Topic :: System :: Distributed Computing Provides: mrjob Requires-Dist: PyYAML (>=3.10) Provides-Extra: aws Requires-Dist: boto3 (>=1.10.0) ; extra == 'aws' Requires-Dist: botocore (>=1.13.26) ; extra == 'aws' Provides-Extra: google Requires-Dist: google-cloud-dataproc (<=1.1.0,>=0.3.0) ; extra == 'google' Requires-Dist: google-cloud-logging (>=1.9.0) ; extra == 'google' Requires-Dist: google-cloud-storage (>=1.13.1) ; extra == 'google' Provides-Extra: rapidjson Requires-Dist: python-rapidjson ; extra == 'rapidjson' Provides-Extra: simplejson Requires-Dist: simplejson ; extra == 'simplejson' Provides-Extra: ujson Requires-Dist: ujson ; extra == 'ujson' mrjob: the Python MapReduce library =================================== .. image:: https://github.com/Yelp/mrjob/raw/master/docs/logos/logo_medium.png mrjob is a Python 2.7/3.4+ package that helps you write and run Hadoop Streaming jobs. `Stable version (v0.7.4) documentation `_ `Development version documentation `_ .. image:: https://travis-ci.org/Yelp/mrjob.png :target: https://travis-ci.org/Yelp/mrjob mrjob fully supports Amazon's Elastic MapReduce (EMR) service, which allows you to buy time on a Hadoop cluster on an hourly basis. mrjob has basic support for Google Cloud Dataproc (Dataproc) which allows you to buy time on a Hadoop cluster on a minute-by-minute basis. It also works with your own Hadoop cluster. Some important features: * Run jobs on EMR, Google Cloud Dataproc, your own Hadoop cluster, or locally (for testing). * Write multi-step jobs (one map-reduce step feeds into the next) * Easily launch Spark jobs on EMR or your own Hadoop cluster * Duplicate your production environment inside Hadoop * Upload your source tree and put it in your job's ``$PYTHONPATH`` * Run make and other setup scripts * Set environment variables (e.g. ``$TZ``) * Easily install python packages from tarballs (EMR only) * Setup handled transparently by ``mrjob.conf`` config file * Automatically interpret error logs * SSH tunnel to hadoop job tracker (EMR only) * Minimal setup * To run on EMR, set ``$AWS_ACCESS_KEY_ID`` and ``$AWS_SECRET_ACCESS_KEY`` * To run on Dataproc, set ``$GOOGLE_APPLICATION_CREDENTIALS`` * No setup needed to use mrjob on your own Hadoop cluster Installation ------------ ``pip install mrjob`` As of v0.7.0, Amazon Web Services and Google Cloud Services are optional depedencies. To use these, install with the ``aws`` and ``google`` targets, respectively. For example: ``pip install mrjob[aws]`` A Simple Map Reduce Job ----------------------- Code for this example and more live in ``mrjob/examples``. .. code-block:: python """The classic MapReduce job: count the frequency of words. """ from mrjob.job import MRJob import re WORD_RE = re.compile(r"[\w']+") class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in WORD_RE.findall(line): yield (word.lower(), 1) def combiner(self, word, counts): yield (word, sum(counts)) def reducer(self, word, counts): yield (word, sum(counts)) if __name__ == '__main__': MRWordFreqCount.run() Try It Out! ----------- :: # locally python mrjob/examples/mr_word_freq_count.py README.rst > counts # on EMR python mrjob/examples/mr_word_freq_count.py README.rst -r emr > counts # on Dataproc python mrjob/examples/mr_word_freq_count.py README.rst -r dataproc > counts # on your Hadoop cluster python mrjob/examples/mr_word_freq_count.py README.rst -r hadoop > counts Setting up EMR on Amazon ------------------------ * create an `Amazon Web Services account `_ * Get your access and secret keys (click "Security Credentials" on `your account page `_) * Set the environment variables ``$AWS_ACCESS_KEY_ID`` and ``$AWS_SECRET_ACCESS_KEY`` accordingly Setting up Dataproc on Google ----------------------------- * `Create a Google Cloud Platform account `_, see top-right * `Learn about Google Cloud Platform "projects" `_ * `Select or create a Cloud Platform Console project `_ * `Enable billing for your project `_ * Go to the `API Manager `_ and search for / enable the following APIs... * Google Cloud Storage * Google Cloud Storage JSON API * Google Cloud Dataproc API * Under Credentials, **Create Credentials** and select **Service account key**. Then, select **New service account**, enter a Name and select **Key type** JSON. * Install the `Google Cloud SDK `_ Advanced Configuration ---------------------- To run in other AWS regions, upload your source tree, run ``make``, and use other advanced mrjob features, you'll need to set up ``mrjob.conf``. mrjob looks for its conf file in: * The contents of ``$MRJOB_CONF`` * ``~/.mrjob.conf`` * ``/etc/mrjob.conf`` See `the mrjob.conf documentation `_ for more information. Project Links ------------- * `Source code `__ * `Documentation `_ * `Discussion group `_ Reference --------- * `Hadoop Streaming `_ * `Elastic MapReduce `_ * `Google Cloud Dataproc `_ More Information ---------------- * `PyCon 2011 mrjob overview `_ * `Introduction to Recommendations and MapReduce with mrjob `_ (`source code `__) * `Social Graph Analysis Using Elastic MapReduce and PyPy `_ Thanks to `Greg Killion `_ (`ROMEO ECHO_DELTA `_) for the logo.