R语言代写Lab 1 Review to R and Data Structures

🍐 我们总结了R语言代写中，美国R语言代写的经典案例，如果你有任何作业代写的需求，可以随时联系我们，CoursePear™ From @2009。

title: “Lab 1 Assignment – Review/Introduction to R and Data Structures”
subtitle: “SOC 325: Quantified-Self”
author: “PUT YOUR NAME HERE”
date: “r Sys.Date()“
output:
html_document: toc: true

Write all code in the chunks provided. Complete this .Rmd file and knit it into an .html. You must upload both files for credit.

Remember to unzip to a real directory before running everything!

1. Basics of R and R markdown

1.1. Create a vector containing elements 10, 22, 27, 19, 20 and assign it with a name.

1.2. Use R as a calculator to compute the following values.

a) 27(38-17)
b) ln(14^7)
c) sqrt(436/12)

1.3. Run the below code to create a vector. Observe what e contains and use `?seq` to see help of function `seq()`.

e <- seq(0, 10, length=5)
e

Create the following vectors:
b = (87, 86, 85, …, 56)

What is the 19th, 20th, and 21st elements of b?

1.4. Compute the following statistics of b:

a) sum
b) median
c) standard deviation

1.5. Following the example given in lab1, mix in-line R calculations with text and make reference to vector b. You must use in-line R calculations at least once (e.g. functions like mean(), sd(), max()) and may not hard-code any numbers referenced in your text. An example is given below:

The average of b is mean(b).

2. Research Question (You don’t need code for this question)

For this problem you’ll answer some questions to help explore your interests in data science. These are questions that you’re interested in. They don’t have to be things that you know the answer to and still less new areas of study.

However, problem 3 asks you to come up with a ‘big data’ dataset that you think you might use to answer your question. If you’re new to R or not sure about what to do, I encourage you to use the Airbnb data that we’ll be using in class. In that case, make sure that your answers to problem 2 relate to the airbnb data.

2.2: Provide a link to a dataset which you think intersects with one of your interests. Explain the connection. You can find datasets by doing a google search or by looking here:http://hadoopilluminated.com/hadoop_illuminated/Public_Bigdata_Sets.html or here: https://www.kaggle.com/

If you’re using the airbnb set, explain how it connects to your interests.

3. Import data and identify variables

3.1. Import your data into R and output the column names.

3.2. Use View(), head() or tail() to check your data. What variables does it contain? How many rows are in your data? What is the unit of analysis in your data?

3.3. Discuss how might some variables serve your research interest as discussed in problem 2 above.

Problem 4: Piping Hot Variables

This problem uses dplyr verbs to answer questions about an Airbnb data set.

4.1: Get the data

Go to Inside Airbnb and download the “Detailed Listings” data for Seattle, listings.csv.gz. This file has many more variables than the “Summary” file we’ve been using in class. Put it in a data/ subfolder in your hw-02 project folder.

[This is a compressed (gzipped) file, but R should be able to handle it as-is. If you run into trouble, try unzipping the file before reading it into R.]

4.2: Set up your R environment

a. Load the tidyverse
b. Read the detailed Airbnb data into R

4.3: Use the data to answer a question

For how many units does the host live in a different neighborhood from the listing? For how many units does the host live in the same neighborhood as the listing?

Try to figure out which variables to use from their names, and think about which verbs you’ve learned about might work to answer this question. See the hints at the end if you need help.

4.4: Build on your answer

Building on that work, what is the average number of listings for hosts that live in the same neighborhood as their listing? What’s the average for hosts who live in different neighborhoods from their listing?

The mean function will take the average of a variable, but you might need to look up how to use it. See the hints for more suggestions if you get stuck.

4.5: Reflect and interpret

Reflect on your answer to 1.4. What might cause the results you got? How does that connect to the idea that Airbnb might be changing neighborhoods?

Your answer should be at least a few sentences here

5. Prepare and Visualize data

5.1. Set up your environment

Set up your environment by:

Reading the Airbnb data: There’s another new data set in the data/ folder. This one has almost 10,000 cases and the census data by zipcode. These data are from New York City, not Seattle!

We’ve given you absolute populations and proportions for the racial composition of the zipcode for each listing. We’ve also made a variable called ‘modal_race’ which is the race with the largest proportion in that neighborhood.

These variables are all in the last columns of the data set—you can try selecting them and using summary() to get a sense for what they contain.

5.2: Turn `price` into a number

price includes dollar signs, which means that R interprets it as a character. We want it to be a numeric variable instead. Turn price into a numeric variable in the chunk below.

There are a few ways to do this using tidyverse functions. See the hints below for some suggestions.

5.3: Make a scatterplot

Use a scatter plot to compare how unit prices change with the proportion of a particular race.

Bonus: try grouping by zipcode (in any fashion) for this plot

5.4: Make a boxplot

Use the modal_race variable to plot a boxplot comparing race and price. You may have to look up how to make a boxplot in ggplot2—what geom do you need?

Bonus: try showing how this comparison differs by neighborhood group.

5.5: Interpret your answer

Interpret your answer to 5.4. Check the hints if you need help.

Your answer should be at least a few sentences here

Bonus: how did we make the data?

There’s another file in the data folder, census.csv. Read it into R and have a look at it.

Download the full listings for New York City from Inside Airbnb, and see if you can join the Census data to it by zipcode using left_join. You’ll have to filter out some weird values for zipcode before you can merge.

6. Your own data

6.1. Looking at the datasets used so far, think about a research question you’d like to investigate (try search about existing studies around your question). What variables do you plan to use to answer your question?

6.2. What is one way that you have to modify or examine your data to begin to answer your question?

6.3. Using the functions we’ve worked with in class (select, filter, arrange, mutate), plus any others you’d like to use, clean and transform your data set to make it ready for further exploration.

You must:

a. Create a new dataset that only includes the variables you’re interested in
b. Output a version of that dataset that only includes certain values of observations, hopefully ones you’re interested in.
c. Order your data by the values of one variable you’re interested in.
d. Create a modified version of one of your variables (many of you will need to do this, but even if you don’t, I want to see that you can)
e. Look up and try out one new verb for data transformation. The RStudio data transformation cheat sheet is a fantastic place to start: https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf

For e., we’d recommend using group_by + summarize. You can group your data by one variable, and then see the mean (or similar) of another variable within each of those groups.

Use as many code blocks as you need for a-e

Hints

4.3 Try using these steps:

Step 1: identify the variables you need
Listing neighborhood: neighbourhood
Host’s neighborhood: host_neighbourhood
Step 2: Filter the data to only include the rows where those variables are not equal. Look back to Module 2 (or look online) if you need a reminder about how to write “equal”, “not equal”, and so on in R.
Step 3: How many rows are left in the filtered data?

Extra food for thought: how do “NA” (missing) values get handled here? Do you think that makes sense? Should you do something else with them, maybe using is.na?

4.4 The variable for number of listings is host_listings_count. You might want to make a new variable indicating if a host is a local host (your answer to 1.3 will help here!). There are many ways to use mean on a subset of data, but the best approach is one we introduce in Module 5: group_by + summarize. Try it out now if you can! For this problem, don’t worry about NAs.

5.2

Use mutate for this. You can replace the original price variable, or name it something else. There are a couple things you can use on price inside the mutate:

parse_number, a function in the readr package, does a good job of converting currency to numbers on its own.
str_extract with pattern = "\\d+", then as.numeric, will extract numbers from a string, then convert the new (sub)string to a number.
str_remove_all, with pattern = "[\\$|,]", then as.numeric, will remove all dollar signs and commas.

5.5

Check out these resources if you’re not sure about interpreting box plots:

6.3

a. use select()
b. use filter()
c. use arrange()
d. use mutate()
e. use group_by(var1) %>% summarise(mean = mean(var2))

Write all code in the chunks provided!

Remember to unzip to a real directory before running everything!

Problems should be roughly analogous to what we’ve done in class, with a few extensions. There are hints at the bottom of this document if you get stuck. If you still can’t figure it out, go to google/stack exchange/ask a friend. Finally, email your TA or come to office hours :).

Problem 7: Google Trends

7.1

Go to Google Trends and search for “covid-19 vaccine”. Look at variations by time and by region in US. What do you observe?

Problem 7: Join data frames

In this problem we will use data in the nycflightdata13 package to perform joining of data frames.

It includes five dataframes, some of which contain missing data (NA).

flights: flights leaving JFK, LGA or EWR in 2013
airlines: airline abbreviations
airports: airport metadata
planes: airplane metadata
weather: hourly weather data from JFK, LGA and EWR

Note these are separate data frames, each needing to be loaded separately using data().

7.1. Set up your environment:

a. Install and load the nycflights13 package. Load the tidyverse package.
b. Load data sets flights, planes, airlines

7.2 Find data frames

We’ll be looking at who manufactures the planes that flew to Seattle. Which are the two data frames we need to join?

7.3. Find common keys

Take a look at variables contained the two data frames. Which variable(s) should be used as the key to join?

7.4. Join the two data frames

7.5. Build on your answer

For flights with a destination of Seattle, who are the largest manufacturers? Give top five of the manufacturers.
(Check hints if you have troubles)

7.6. Use the data to anwer the below questions

We’d like to know which airlines had the most flights to Seattle from NYC. Which are the two data frames we need to join, and on which key variable(s)?

7.7.

Join the two data frames in 7.6 and list the top five airlines.

Problem 8: Your research question

Think about the research question you have in mind. Plot is a great way to understand patterns, key relationships and uncertainties in a data set. Here we’ll ask you to plan about plotting your variables of interest for your research question. Try to think about 3 plots below:

For each of the 3 plots, provide:

A. The purpose of the plot: what do you want people to understand when they see this?

B. The type of plot: what geom functions will you use to present the plot? Why are those the best choices?

C. Limitations/biases: What is missing from this presentation? Could someone get the wrong idea? What can you do to help limit the negative possibilities here?

Plot idea 1

A.
B.
C.

Plot idea 2

A.
B.
C.

Plot idea 3

A.
B.
C.

Hint

7.5

use left_join() by “tailnum” to join the two data frames, then count() observations by manufacturer, and then use arrange() with descending order.

CoursePear™是一家服务全球留学生的专业代写。
—-我们专注提供高质靠谱的美国、加拿大、英国、澳洲、新西兰代写服务。
—-我们专注提供Essay、统计、金融、CS、经济、数学等覆盖100+专业的作业代写服务。

CoursePear™提供各类学术服务，Essay代写，Assignment代写，Exam / Quiz助攻，Dissertation / Thesis代写，Problem Set代做等。

1. Basics of R and R markdown

1.1. Create a vector containing elements 10, 22, 27, 19, 20 and assign it with a name.

1.2. Use R as a calculator to compute the following values.

1.3. Run the below code to create a vector. Observe what e contains and use ?seq to see help of function seq().

1.4. Compute the following statistics of b:

1.5. Following the example given in lab1, mix in-line R calculations with text and make reference to vector b. You must use in-line R calculations at least once (e.g. functions like mean(), sd(), max()) and may not hard-code any numbers referenced in your text. An example is given below:

2. Research Question (You don’t need code for this question)

2.1: What are some areas of interest for you within sociology, big data, and computational social science?

2.2: Provide a link to a dataset which you think intersects with one of your interests. Explain the connection. You can find datasets by doing a google search or by looking here:http://hadoopilluminated.com/hadoop_illuminated/Public_Bigdata_Sets.html or here: https://www.kaggle.com/

3. Import data and identify variables

3.1. Import your data into R and output the column names.

3.2. Use View(), head() or tail() to check your data. What variables does it contain? How many rows are in your data? What is the unit of analysis in your data?

3.3. Discuss how might some variables serve your research interest as discussed in problem 2 above.

Problem 4: Piping Hot Variables

4.1: Get the data

4.2: Set up your R environment

4.3: Use the data to answer a question

4.4: Build on your answer

4.5: Reflect and interpret

5. Prepare and Visualize data

5.1. Set up your environment

5.2: Turn price into a number

5.3: Make a scatterplot

5.4: Make a boxplot

5.5: Interpret your answer

Bonus: how did we make the data?

6. Your own data

6.1. Looking at the datasets used so far, think about a research question you’d like to investigate (try search about existing studies around your question). What variables do you plan to use to answer your question?

6.2. What is one way that you have to modify or examine your data to begin to answer your question?

6.3. Using the functions we’ve worked with in class (select, filter, arrange, mutate), plus any others you’d like to use, clean and transform your data set to make it ready for further exploration.

Hints

Problem 7: Google Trends

7.1

Problem 7: Join data frames

7.1. Set up your environment:

7.2 Find data frames

7.3. Find common keys

7.4. Join the two data frames

7.5. Build on your answer

7.6. Use the data to anwer the below questions

7.7.

Problem 8: Your research question

Plot idea 1

Plot idea 2

Plot idea 3

Stay In Touch

1.3. Run the below code to create a vector. Observe what e contains and use `?seq` to see help of function `seq()`.

5.2: Turn `price` into a number