🍐 我们总结了R语言代写中，美国R语言代写的经典案例，如果你有任何作业代写的需求，可以随时联系我们，CoursePear™ From @2009。

title: “Lab 1 Assignment – Review/Introduction to R and Data Structures”
subtitle: “SOC 325: Quantified-Self”
date: “`r Sys.Date()`
output:
html_document: toc: true

Write all code in the chunks provided. Complete this `.Rmd` file and knit it into an `.html`. You must upload both files for credit.

Remember to unzip to a real directory before running everything!

### 1. Basics of R and R markdown

a) 27(38-17)
b) ln(14^7)
c) sqrt(436/12)

#### 1.3. Run the below code to create a vector. Observe what e contains and use `?seq` to see help of function `seq()`.

``````e <- seq(0, 10, length=5)
e``````

Create the following vectors:
b = (87, 86, 85, …, 56)

What is the 19th, 20th, and 21st elements of b?

#### 1.4. Compute the following statistics of b:

a) sum
b) median
c) standard deviation

#### 1.5. Following the example given in lab1, mix in-line R calculations with text and make reference to vector b. You must use in-line R calculations at least once (e.g. functions like mean(), sd(), max()) and may not hard-code any numbers referenced in your text. An example is given below:

The average of `b` is `mean(b)`.

### 2. Research Question (You don’t need code for this question)

For this problem you’ll answer some questions to help explore your interests in data science. These are questions that you’re interested in. They don’t have to be things that you know the answer to and still less new areas of study.

However, problem 3 asks you to come up with a ‘big data’ dataset that you think you might use to answer your question. If you’re new to R or not sure about what to do, I encourage you to use the Airbnb data that we’ll be using in class. In that case, make sure that your answers to problem 2 relate to the airbnb data.

#### 2.1: What are some areas of interest for you within sociology, big data, and computational social science?

If you’re using the airbnb set, explain how it connects to your interests.

### Problem 4: Piping Hot Variables

This problem uses `dplyr` verbs to answer questions about an Airbnb data set.

#### 4.1: Get the data

Go to Inside Airbnb and download the “Detailed Listings” data for Seattle, `listings.csv.gz`. This file has many more variables than the “Summary” file we’ve been using in class. Put it in a `data/` subfolder in your `hw-02` project folder.

[This is a compressed (gzipped) file, but R should be able to handle it as-is. If you run into trouble, try unzipping the file before reading it into R.]

#### 4.2: Set up your R environment

b. Read the detailed Airbnb data into R

#### 4.3: Use the data to answer a question

For how many units does the host live in a different neighborhood from the listing? For how many units does the host live in the same neighborhood as the listing?

Try to figure out which variables to use from their names, and think about which verbs you’ve learned about might work to answer this question. See the hints at the end if you need help.

Building on that work, what is the average number of listings for hosts that live in the same neighborhood as their listing? What’s the average for hosts who live in different neighborhoods from their listing?

The `mean` function will take the average of a variable, but you might need to look up how to use it. See the hints for more suggestions if you get stuck.

#### 4.5: Reflect and interpret

Reflect on your answer to 1.4. What might cause the results you got? How does that connect to the idea that Airbnb might be changing neighborhoods?

### 5. Prepare and Visualize data

#### 5.1. Set up your environment

Reading the Airbnb data: There’s another new data set in the `data/` folder. This one has almost 10,000 cases and the census data by zipcode. These data are from New York City, not Seattle!

We’ve given you absolute populations and proportions for the racial composition of the zipcode for each listing. We’ve also made a variable called ‘modal_race’ which is the race with the largest proportion in that neighborhood.

These variables are all in the last columns of the data set—you can try selecting them and using `summary()` to get a sense for what they contain.

#### 5.2: Turn `price` into a number

`price` includes dollar signs, which means that R interprets it as a character. We want it to be a numeric variable instead. Turn `price` into a numeric variable in the chunk below.

There are a few ways to do this using `tidyverse` functions. See the hints below for some suggestions.

#### 5.3: Make a scatterplot

Use a scatter plot to compare how unit prices change with the proportion of a particular race.

Bonus: try grouping by zipcode (in any fashion) for this plot

#### 5.4: Make a boxplot

Use the `modal_race` variable to plot a boxplot comparing race and price. You may have to look up how to make a boxplot in `ggplot2`—what geom do you need?

Bonus: try showing how this comparison differs by neighborhood group.

Interpret your answer to 5.4. Check the hints if you need help.

#### Bonus: how did we make the data?

There’s another file in the data folder, census.csv. Read it into R and have a look at it.

Download the full listings for New York City from Inside Airbnb, and see if you can join the Census data to it by zipcode using `left_join`. You’ll have to filter out some weird values for zipcode before you can merge.

#### 6.3. Using the functions we’ve worked with in class (select, filter, arrange, mutate), plus any others you’d like to use, clean and transform your data set to make it ready for further exploration.

You must:

a. Create a new dataset that only includes the variables you’re interested in
b. Output a version of that dataset that only includes certain values of observations, hopefully ones you’re interested in.
c. Order your data by the values of one variable you’re interested in.
d. Create a modified version of one of your variables (many of you will need to do this, but even if you don’t, I want to see that you can)
e. Look up and try out one new verb for data transformation. The RStudio data transformation cheat sheet is a fantastic place to start: https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf

For e., we’d recommend using `group_by` + `summarize`. You can group your data by one variable, and then see the mean (or similar) of another variable within each of those groups.

Use as many code blocks as you need for a-e

### Hints

4.3 Try using these steps:

• Step 1: identify the variables you need
• Listing neighborhood: `neighbourhood`
• Host’s neighborhood: `host_neighbourhood`
• Step 2: Filter the data to only include the rows where those variables are not equal. Look back to Module 2 (or look online) if you need a reminder about how to write “equal”, “not equal”, and so on in R.
• Step 3: How many rows are left in the filtered data?

Extra food for thought: how do “NA” (missing) values get handled here? Do you think that makes sense? Should you do something else with them, maybe using `is.na`?

4.4 The variable for number of listings is `host_listings_count`. You might want to make a new variable indicating if a host is a local host (your answer to 1.3 will help here!). There are many ways to use `mean` on a subset of data, but the best approach is one we introduce in Module 5: `group_by` + `summarize`. Try it out now if you can! For this problem, don’t worry about NAs.

5.2

Use `mutate` for this. You can replace the original `price` variable, or name it something else. There are a couple things you can use on `price` inside the mutate:

• `parse_number`, a function in the `readr` package, does a good job of converting currency to numbers on its own.
• `str_extract` with `pattern = "\\d+"`, then `as.numeric`, will extract numbers from a string, then convert the new (sub)string to a number.
• `str_remove_all`, with `pattern = "[\\\$|,]"`, then `as.numeric`, will remove all dollar signs and commas.

5.5

Check out these resources if you’re not sure about interpreting box plots:

6.3

a. use select()
b. use filter()
c. use arrange()
d. use mutate()
e. use group_by(var1) %>% summarise(mean = mean(var2))

Write all code in the chunks provided!

Remember to unzip to a real directory before running everything!

Problems should be roughly analogous to what we’ve done in class, with a few extensions. There are hints at the bottom of this document if you get stuck. If you still can’t figure it out, go to google/stack exchange/ask a friend. Finally, email your TA or come to office hours :).

#### 7.1

Go to Google Trends and search for “covid-19 vaccine”. Look at variations by time and by region in US. What do you observe?

### Problem 7: Join data frames

In this problem we will use data in the `nycflightdata13` package to perform joining of data frames.

It includes five dataframes, some of which contain missing data (`NA`).

• `flights`: flights leaving JFK, LGA or EWR in 2013
• `airlines`: airline abbreviations
• `airports`: airport metadata
• `planes`: airplane metadata
• `weather`: hourly weather data from JFK, LGA and EWR

Note these are separate data frames, each needing to be loaded separately using `data()`.

#### 7.1. Set up your environment:

a. Install and load the `nycflights13` package. Load the `tidyverse` package.
b. Load data sets `flights`, `planes`, `airlines`

#### 7.2 Find data frames

We’ll be looking at who manufactures the planes that flew to Seattle. Which are the two data frames we need to join?

#### 7.3. Find common keys

Take a look at variables contained the two data frames. Which variable(s) should be used as the key to join?

#### 7.4. Join the two data frames

For flights with a destination of Seattle, who are the largest manufacturers? Give top five of the manufacturers.
(Check hints if you have troubles)

#### 7.6. Use the data to anwer the below questions

We’d like to know which airlines had the most flights to Seattle from NYC. Which are the two data frames we need to join, and on which key variable(s)?

#### 7.7.

Join the two data frames in 7.6 and list the top five airlines.

#### Problem 8: Your research question

Think about the research question you have in mind. Plot is a great way to understand patterns, key relationships and uncertainties in a data set. Here we’ll ask you to plan about plotting your variables of interest for your research question. Try to think about 3 plots below:

For each of the 3 plots, provide:

A. The purpose of the plot: what do you want people to understand when they see this?

B. The type of plot: what geom functions will you use to present the plot? Why are those the best choices?

C. Limitations/biases: What is missing from this presentation? Could someone get the wrong idea? What can you do to help limit the negative possibilities here?

A.
B.
C.

A.
B.
C.

#### Plot idea 3

A.
B.
C.

Hint

7.5

use `left_join()` by “tailnum” to join the two data frames, then count() observations by manufacturer, and then use arrange() with descending order.

CoursePear™提供各类学术服务，Essay代写Assignment代写Exam / Quiz助攻Dissertation / Thesis代写Problem Set代做等。