- 作业标题: APAN 5400 - Assignment 9 Apache Spark Distributed Application
- 课程名称:C-lumbia University APAN 5400 Managing Data
- 完成周期:2天
Develop an Apache Spark application per provided specifications and Crunchbase Open Data Map organizations dataset, using PySpark in Google Colab.
Details
Use the Week11_ClassExercise.ipynb (this file was sent to you in an announcement) as a reference:
- Create a new notebook in Google Colab
- Upload the crunchbase_odm_orgs.csv (this file was sent to you in an announcement) file and upload it to the “Files” section in your Colab notebook (may take a few minutes to upload)
- Read the Crunchbase Orgs dataset into Spark DataFrame
Implement PySpark code using DataFrames, RDDs or Spark UDF functions:
- Find all entities with the name that starts with a letter “F” (e.g. Facebook, etc.):
- print the count and show() the resulting Spark DataFrame
- Find all entities located in New York City:
- print the count and show() the resulting Spark DataFrame
- Add a “Blog” column to the DataFrame with the row entries set to 1 if the “domain” field contains “blogspot.com”, and 0 otherwise.
- show() only the records with the “Blog” field marked as 1
- Find all entities with names that are palindromes (name reads the same way forward and reverse, e.g. madam):
- print the count and show() the resulting Spark DataFrame
Assessment
Please see the attached rubric for detailed assessment criteria.
Submission
To complete your submission,
- Please submit a PDF file or Word Document.
- Click the blue Submit Assignment button at the top of this page.
- Click the Choose File button, and locate your submission.
- Feel free to include a comment with your submission.
- Finally, click the blue Submit Assignment button.
。。。