STAT 19000: Project 14 — Spring 2022
Motivation: We covered a lot this year! When dealing with data driven projects, it is useful to explore the data, and answer different questions to get a feel for it. There are always different ways one can go about this. Proper preparation prevents poor performance, in this project we are going to practice using some of the skills you’ve learned, and review topics and languages in a generic way.
Context: We are on the final stretch of two projects where there will be an assortment of "random" questions that may involve various datasets (and languages/tools). We may even ask a question that asks you to use a tool you haven’t used before — but don’t worry, if we do, we will provide you with extra guidance.
Scope: Python, R, bash, unix, computers
Dataset(s)
The following questions will use the following dataset(s):
-
/depot/datamine/data/airbnb/**/reviews.csv.gz
-
/depot/datamine/data/election/itcont2022.txt
-
/depot/datamine/data/death_records/DeathRecords.csv
Questions
Question 1
Scan through the reviews.csv.gz
files in /depot/datamine/data/airbnb/*
and find the 10 most common reviewer_name
values.
The In particular, check out the example(s) in the basic use section. The |
You can read |
The following is an example of one way you could sum the values of a dictionary.
|
Test your code on a few of the |
-
Code used to solve this problem.
-
Output from running the code.
Question 2
After completing question 1, it is likely you have a solid understanding on how the data is organized. Add some logic to your code from question 1 to instead print the 5 most common names for each country.
If your $HOME
country (haha) is in the list — do the names sound about right? What kind of bias does this data likely show?
-
Code used to solve this problem.
-
Output from running the code.
Question 3
Checkout the newest set of election data /depot/datamine/data/election/itcont2022.txt
. Let’s say we are interested in all entries (rows) that have the word "purdue" in it (of course, this may include entries that don’t relate to Purdue University, but we are okay with that error).
This is around 5 GB of data, and only a small fraction of that has relevant information. In pandas
, there is not an ergonomic way to check if a row of data has a string in it. This is where knowing how to use multiple tools will come in handy!
There is a tool called grep
that can very quickly search large text files for certain text. We will learn more about grep
(and other useful command line utilities) in STAT 29000. With that being said, why not figure out how to use grep
to create a subset of data to read into pandas
that is already filtered — it isn’t too bad!
Use grep
to create a subset of data called my_election_data.txt
. my_election_data.txt
should contain only the rows that have the word "purdue" in it. my_election_data.txt
should live in your $HOME
directory: /home/purduealias/my_election_data.txt
.
-
Use grep to find only rows with the word "purdue" in them (case insensitive). Use redirection to save the output to
$HOME/my_election_data.txt
.You can use the
-i
flag to make yourgrep
search case insensitive — this means that rows with "Purdue" or "purdue" or "PuRdUe" would be found.You can run
grep
from within Jupyter Notebooks using the%%bash
magic. For example, the following would find the word "apple" in a dataset and create a new file called "my_new_file.csv" in my$HOME
directory.%%bash grep "apple" /depot/datamine/data/yelp/data/json/yelp_academic_dataset_review.json > $HOME/my_new_file.csv
In order to insert the header line into your newly created file, you can run the following
sed
command directly after yourgrep
command.sed -i '1 i\CMTE_ID|AMNDT_IND|RPT_TP|TRANSACTION_PGI|IMAGE_NUM|TRANSACTION_TP|ENTITY_TP|NAME|CITY|STATE|ZIP_CODE|EMPLOYER|OCCUPATION|TRANSACTION_DT|TRANSACTION_AMT|OTHER_ID|TRAN_ID|FILE_NUM|MEMO_CD|MEMO_TEXT|SUB_ID' $HOME/my_election_data.txt
-
Use
pandas
to read in your newly created, much smaller dataset,$HOME/my_election_data.txt
.
Finally, print the EMPLOYER
, NAME
, OCCUPATION
, and TRANSACTION_AMT
, for the top 10 donations (by size).
You may notice that each row represents a single donation. Group the data by the NAME
column to get the total amount of donation per individual. What is the NAME
of the top donor?
-
Code used to solve this problem.
-
Output from running the code.
Question 4
What is the average age of death for individuals who were married, single, divorced, widowed, or unknown?
Further split the data by Sex
— do the same patterns hold? Dig in a bit and notice that how we look at the data can make a very big difference!
-
Code used to solve this problem.
-
Output from running the code.
Question 5
It has been a fun year. We hope that you learned something new!
-
Write down 3 (or more) of your least favorite topics and/or projects from this past year (for STAT 19000).
-
Write down 3 (or more) of your favorite projects and/or topics you wish you were able to learn more about.
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |