Current Project Objectives

This week was primarily centered on clarifying the focus of our project, as well as on the retrieval of relevant sources of data. During the first part of the week, our group searched TidyCensus, the USDA and various other internet sources for data that could potentially be useful for whichever type of tool that we and our client decide on.

It could be said that we have started moving into the second phase of our project this week: data collection.

On Thursday, we met with our client and some of his associates at the the research park to discuss the direction of the project. We had them answer various questions on Mentimeter and showed them the list of datasets under consideration. They also gave us numerous suggestions about sources they were familiar with and told us that they would help grant us access to those of which were not immediately available to us.

Here are some of the Mentimeter results from our meeting:

As we are wrapping up this week, we have now shifted our focus to compiling our selected data to our repository and filtering through sources that may not be as useful as others. We are continuing to think of ways to optimize this process, as well as remaining open to other sources of data or information that we have yet to discover. Here are some of the resources under consideration:

Works in Progress

Aaron

This week has been less coding-centric than the previous week, as many of our efforts have been centered on uncovering useful sources of data. However, I have tried to work through ways to accelerate the time-consuming process of labeling and pushing CSVs of tables of ACS data.

Here is an example of such a table: “Mode of Transport to Work by County 2021”

I have also written a function that I’ve been using to speed the process up. It’s a data labeler function that takes “table”, “year” and “geography as arguments and outputs a labelled ACS5 data table by joining labels from the”load_variables()” function in TidyCensus by the common variable code on both dataframes.

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(tidycensus)
library(stringr)
library(readr)



data_labeler_Iowa <- function(table, year, geography){
  df <- tidycensus::get_acs(geography = geography,
                table = table,
                state = "IA",
                year = year,
                survey = "acs5")
  
  
  vars <- tidycensus::load_variables(year, "acs5")
  
  vars <- vars %>%
    dplyr::filter(stringr::str_detect(name, table))
  
  vars <- vars %>% 
    dplyr::rename(variable = name)
  
  vars <- vars %>%
    dplyr::select(variable, label)
  
  df2 <- merge(df,vars)
  
  df2 %>%
    dplyr::arrange(GEOID)
}

Median_Household_Income_County_2021 <- data_labeler_Iowa("B19013", 2021, "county")

Getting data from the 2017-2021 5-year ACS

head(Median_Household_Income_County_2021)

    variable GEOID                   NAME estimate  moe
1 B19013_001 19001     Adair County, Iowa    57944 4047
2 B19013_001 19003     Adams County, Iowa    57981 7853
3 B19013_001 19005 Allamakee County, Iowa    59461 2644
4 B19013_001 19007 Appanoose County, Iowa    46900 5645
5 B19013_001 19009   Audubon County, Iowa    54643 5861
6 B19013_001 19011    Benton County, Iowa    72334 3626
                                                                                         label
1 Estimate!!Median household income in the past 12 months (in 2021 inflation-adjusted dollars)
2 Estimate!!Median household income in the past 12 months (in 2021 inflation-adjusted dollars)
3 Estimate!!Median household income in the past 12 months (in 2021 inflation-adjusted dollars)
4 Estimate!!Median household income in the past 12 months (in 2021 inflation-adjusted dollars)
5 Estimate!!Median household income in the past 12 months (in 2021 inflation-adjusted dollars)
6 Estimate!!Median household income in the past 12 months (in 2021 inflation-adjusted dollars)

Building on this, I tried to iterate through all of these arguments by using nested for loops and writing csvs automatically. However, for a reason I don’t entirely understand, I haven’t gotten it to work the way I should. It’s possible that this has to do with the limitations of the API call and its speed within the loop.

yearlist <- c(2009, 2012, 2016, 2021)
geo_list <- c("county", "tract")
Income_table_list <- c("B19013", "B19113", "B19202", "B19051", "B19055")
acs_table <- NULL


for(table in Income_table_list)

  for(year in yearlist)
  
    for(geography in geo_list)
    
      acs_table <- tidycensus::get_acs(geography = geography,
                                       table = table,
                                       state = "IA",
                                       year = year,
                                       survey = "acs5")
      
      write_csv(acs_table, file = sprintf("%s_%s_%s", table, year, geography))

However, even if this didn’t completely work, I learned a lot of useful information, such as the “sprintf” function that provides placeholders for strings that are named from iterating variables in loop, as well as how to call only necessary elements of a library into a function to prevent using too much memory/slowing run time.

Alex

Data Exploration

I spent a lot of time this week doing data exploration. I found the USDA food atlas data set, as well as the Economic Census data. I also explored the TinyUSDA package.

Client Meeting

This week I met with our clients, where we clarified the scope of the project, discussed the goals moving forwards, and shared our progress. I shared about the USDA food atlas and Economic Census data sets

SQL learning

This week I spent time learning PostgreSQL on DataCamp. I also did several other courses related to data visualizations and fundamental statistical skills.

Srika

Training:

Data Camp courses completed:

Introduction to statistics in R

Data visualization with R (in progress)

Created summaries of the client reports:

Understanding the market trends and margins for different departments in a rural grocery store.

Collected some ACS Data

Browsed for other possible useful sources to consider

DSPG Questions

What is the best way to scrape text off of a webpage with R?
What are some general rules/best practices for writing efficient and reliable functions?
What is differential privacy?