mirror of
https://github.com/emilybache/GildedRose-Refactoring-Kata.git
synced 2026-02-14 22:21:20 +00:00
205 lines
7.1 KiB
Plaintext
205 lines
7.1 KiB
Plaintext
Metadata-Version: 2.1
|
|
Name: mrjob
|
|
Version: 0.7.4
|
|
Summary: Python MapReduce framework
|
|
Home-page: http://github.com/Yelp/mrjob
|
|
Author: David Marin
|
|
Author-email: dm@davidmarin.org
|
|
License: Apache
|
|
Platform: UNKNOWN
|
|
Classifier: Development Status :: 5 - Production/Stable
|
|
Classifier: Intended Audience :: Developers
|
|
Classifier: License :: OSI Approved :: Apache Software License
|
|
Classifier: Natural Language :: English
|
|
Classifier: Operating System :: OS Independent
|
|
Classifier: Programming Language :: Python
|
|
Classifier: Programming Language :: Python :: 2
|
|
Classifier: Programming Language :: Python :: 2.7
|
|
Classifier: Programming Language :: Python :: 3
|
|
Classifier: Programming Language :: Python :: 3.4
|
|
Classifier: Programming Language :: Python :: 3.5
|
|
Classifier: Programming Language :: Python :: 3.6
|
|
Classifier: Programming Language :: Python :: 3.7
|
|
Classifier: Topic :: System :: Distributed Computing
|
|
Provides: mrjob
|
|
Requires-Dist: PyYAML (>=3.10)
|
|
Provides-Extra: aws
|
|
Requires-Dist: boto3 (>=1.10.0) ; extra == 'aws'
|
|
Requires-Dist: botocore (>=1.13.26) ; extra == 'aws'
|
|
Provides-Extra: google
|
|
Requires-Dist: google-cloud-dataproc (<=1.1.0,>=0.3.0) ; extra == 'google'
|
|
Requires-Dist: google-cloud-logging (>=1.9.0) ; extra == 'google'
|
|
Requires-Dist: google-cloud-storage (>=1.13.1) ; extra == 'google'
|
|
Provides-Extra: rapidjson
|
|
Requires-Dist: python-rapidjson ; extra == 'rapidjson'
|
|
Provides-Extra: simplejson
|
|
Requires-Dist: simplejson ; extra == 'simplejson'
|
|
Provides-Extra: ujson
|
|
Requires-Dist: ujson ; extra == 'ujson'
|
|
|
|
mrjob: the Python MapReduce library
|
|
===================================
|
|
|
|
.. image:: https://github.com/Yelp/mrjob/raw/master/docs/logos/logo_medium.png
|
|
|
|
mrjob is a Python 2.7/3.4+ package that helps you write and run Hadoop
|
|
Streaming jobs.
|
|
|
|
`Stable version (v0.7.4) documentation <http://mrjob.readthedocs.org/en/stable/>`_
|
|
|
|
`Development version documentation <http://mrjob.readthedocs.org/en/latest/>`_
|
|
|
|
.. image:: https://travis-ci.org/Yelp/mrjob.png
|
|
:target: https://travis-ci.org/Yelp/mrjob
|
|
|
|
mrjob fully supports Amazon's Elastic MapReduce (EMR) service, which allows you
|
|
to buy time on a Hadoop cluster on an hourly basis. mrjob has basic support for Google Cloud Dataproc (Dataproc)
|
|
which allows you to buy time on a Hadoop cluster on a minute-by-minute basis. It also works with your own
|
|
Hadoop cluster.
|
|
|
|
Some important features:
|
|
|
|
* Run jobs on EMR, Google Cloud Dataproc, your own Hadoop cluster, or locally (for testing).
|
|
* Write multi-step jobs (one map-reduce step feeds into the next)
|
|
* Easily launch Spark jobs on EMR or your own Hadoop cluster
|
|
* Duplicate your production environment inside Hadoop
|
|
|
|
* Upload your source tree and put it in your job's ``$PYTHONPATH``
|
|
* Run make and other setup scripts
|
|
* Set environment variables (e.g. ``$TZ``)
|
|
* Easily install python packages from tarballs (EMR only)
|
|
* Setup handled transparently by ``mrjob.conf`` config file
|
|
* Automatically interpret error logs
|
|
* SSH tunnel to hadoop job tracker (EMR only)
|
|
* Minimal setup
|
|
|
|
* To run on EMR, set ``$AWS_ACCESS_KEY_ID`` and ``$AWS_SECRET_ACCESS_KEY``
|
|
* To run on Dataproc, set ``$GOOGLE_APPLICATION_CREDENTIALS``
|
|
* No setup needed to use mrjob on your own Hadoop cluster
|
|
|
|
Installation
|
|
------------
|
|
|
|
``pip install mrjob``
|
|
|
|
As of v0.7.0, Amazon Web Services and Google Cloud Services are optional
|
|
depedencies. To use these, install with the ``aws`` and ``google`` targets,
|
|
respectively. For example:
|
|
|
|
``pip install mrjob[aws]``
|
|
|
|
A Simple Map Reduce Job
|
|
-----------------------
|
|
|
|
Code for this example and more live in ``mrjob/examples``.
|
|
|
|
.. code-block:: python
|
|
|
|
"""The classic MapReduce job: count the frequency of words.
|
|
"""
|
|
from mrjob.job import MRJob
|
|
import re
|
|
|
|
WORD_RE = re.compile(r"[\w']+")
|
|
|
|
|
|
class MRWordFreqCount(MRJob):
|
|
|
|
def mapper(self, _, line):
|
|
for word in WORD_RE.findall(line):
|
|
yield (word.lower(), 1)
|
|
|
|
def combiner(self, word, counts):
|
|
yield (word, sum(counts))
|
|
|
|
def reducer(self, word, counts):
|
|
yield (word, sum(counts))
|
|
|
|
|
|
if __name__ == '__main__':
|
|
MRWordFreqCount.run()
|
|
|
|
Try It Out!
|
|
-----------
|
|
|
|
::
|
|
|
|
# locally
|
|
python mrjob/examples/mr_word_freq_count.py README.rst > counts
|
|
# on EMR
|
|
python mrjob/examples/mr_word_freq_count.py README.rst -r emr > counts
|
|
# on Dataproc
|
|
python mrjob/examples/mr_word_freq_count.py README.rst -r dataproc > counts
|
|
# on your Hadoop cluster
|
|
python mrjob/examples/mr_word_freq_count.py README.rst -r hadoop > counts
|
|
|
|
|
|
Setting up EMR on Amazon
|
|
------------------------
|
|
|
|
* create an `Amazon Web Services account <http://aws.amazon.com/>`_
|
|
* Get your access and secret keys (click "Security Credentials" on
|
|
`your account page <http://aws.amazon.com/account/>`_)
|
|
* Set the environment variables ``$AWS_ACCESS_KEY_ID`` and
|
|
``$AWS_SECRET_ACCESS_KEY`` accordingly
|
|
|
|
Setting up Dataproc on Google
|
|
-----------------------------
|
|
|
|
* `Create a Google Cloud Platform account <http://cloud.google.com/>`_, see top-right
|
|
* `Learn about Google Cloud Platform "projects" <https://cloud.google.com/docs/overview/#projects>`_
|
|
* `Select or create a Cloud Platform Console project <https://console.cloud.google.com/project>`_
|
|
* `Enable billing for your project <https://console.cloud.google.com/billing>`_
|
|
* Go to the `API Manager <https://console.cloud.google.com/apis>`_ and search for / enable the following APIs...
|
|
|
|
* Google Cloud Storage
|
|
* Google Cloud Storage JSON API
|
|
* Google Cloud Dataproc API
|
|
|
|
* Under Credentials, **Create Credentials** and select **Service account key**. Then, select **New service account**, enter a Name and select **Key type** JSON.
|
|
|
|
* Install the `Google Cloud SDK <https://cloud.google.com/sdk/>`_
|
|
|
|
Advanced Configuration
|
|
----------------------
|
|
|
|
To run in other AWS regions, upload your source tree, run ``make``, and use
|
|
other advanced mrjob features, you'll need to set up ``mrjob.conf``. mrjob looks
|
|
for its conf file in:
|
|
|
|
* The contents of ``$MRJOB_CONF``
|
|
* ``~/.mrjob.conf``
|
|
* ``/etc/mrjob.conf``
|
|
|
|
See `the mrjob.conf documentation
|
|
<https://mrjob.readthedocs.io/en/latest/guides/configs-basics.html>`_ for more
|
|
information.
|
|
|
|
|
|
Project Links
|
|
-------------
|
|
|
|
* `Source code <http://github.com/Yelp/mrjob>`__
|
|
* `Documentation <https://mrjob.readthedocs.io/en/latest/>`_
|
|
* `Discussion group <http://groups.google.com/group/mrjob>`_
|
|
|
|
Reference
|
|
---------
|
|
|
|
* `Hadoop Streaming <http://hadoop.apache.org/docs/stable1/streaming.html>`_
|
|
* `Elastic MapReduce <http://aws.amazon.com/documentation/elasticmapreduce/>`_
|
|
* `Google Cloud Dataproc <https://cloud.google.com/dataproc/overview>`_
|
|
|
|
More Information
|
|
----------------
|
|
|
|
* `PyCon 2011 mrjob overview <http://blip.tv/pycon-us-videos-2009-2010-2011/pycon-2011-mrjob-distributed-computing-for-everyone-4898987/>`_
|
|
* `Introduction to Recommendations and MapReduce with mrjob <http://aimotion.blogspot.com/2012/08/introduction-to-recommendations-with.html>`_
|
|
(`source code <https://github.com/marcelcaraciolo/recsys-mapreduce-mrjob>`__)
|
|
* `Social Graph Analysis Using Elastic MapReduce and PyPy <http://postneo.com/2011/05/04/social-graph-analysis-using-elastic-mapreduce-and-pypy>`_
|
|
|
|
Thanks to `Greg Killion <mailto:greg@blind-works.net>`_
|
|
(`ROMEO ECHO_DELTA <http://www.romeoechodelta.net/>`_) for the logo.
|
|
|
|
|