Search code examples
r

Extracting data from a complicated .txt file using r


I am not a programmer, and I don't know much about code; I am a researcher. I have been using ChatGPT to help me with this, which has worked in the past, but no matter what I do, I cannot get it to work after many many hours of trying.

I have 14 variables I need to extract data from across multiple participants. It is organized by variable: there are various tables per participant for one of the variables, followed by the same set of tables for the next participant for the same variable, that repeats for each participant, then it moves on to the next variable, and continues in that pattern.

This data I need extracted and placed into a .csv file. I will explain later what I need the .csv to look like.

This is what the .txt file looks like for two of 14 variables, for the first two participants of each.

Participant Private ID = 11450318.000000000
Case Processing Summarya
|--------------------------------|-------------------------------------------|
|                                |Cases                                      |
|                                |-------------|---------------|-------------|
|                                |Valid        |Missing        |Total        |
|                                |-----|-------|-------|-------|-----|-------|
|                                |N    |Percent|N      |Percent|N    |Percent|
|--------------------------------|-----|-------|-------|-------|-----|-------|
|CC.FOR_Mean * CC.T1TimedAcc_Mean|15   |31.3%  |33     |68.8%  |48   |100.0% |
|----------------------------------------------------------------------------|
 a Participant Private ID = 11450318.000000000

CC.FOR_Mean * CC.T1TimedAcc_Mean Crosstabulationa
Count 
|----------------|-----------------------|-----|
|                |CC.T1TimedAcc_Mean     |Total|
|                |------------------|----|     |
|                |.00               |1.00|     |
|-----------|----|------------------|----|-----|
|CC.FOR_Mean|2.00|1                 |1   |2    |
|           |----|------------------|----|-----|
|           |3.00|1                 |2   |3    |
|           |----|------------------|----|-----|
|           |4.00|1                 |0   |1    |
|           |----|------------------|----|-----|
|           |5.00|1                 |2   |3    |
|           |----|------------------|----|-----|
|           |7.00|1                 |1   |2    |
|           |----|------------------|----|-----|
|           |8.00|1                 |2   |3    |
|           |----|------------------|----|-----|
|           |9.00|0                 |1   |1    |
|----------------|------------------|----|-----|
|Total           |6                 |9   |15   |
|----------------------------------------------|
 a Participant Private ID = 11450318.000000000

Symmetric Measuresa
|------------------------|-----|--------------------------|--------------|------------------------|
|                        |Value|Asymptotic Standard Errorb|Approximate Tc|Approximate Significance|
|------------------|-----|-----|--------------------------|--------------|------------------------|
|Ordinal by Ordinal|Gamma|.217 |.334                      |.644          |.520                    |
|------------------------|-----|--------------------------|--------------|------------------------|
|N of Valid Cases        |15   |                          |              |                        |
|-------------------------------------------------------------------------------------------------|
 a Participant Private ID = 11450318.000000000
 b Not assuming the null hypothesis.
 c Using the asymptotic standard error assuming the null hypothesis.


Participant Private ID = 11450328.000000000
Case Processing Summarya
|--------------------------------|-------------------------------------------|
|                                |Cases                                      |
|                                |-------------|---------------|-------------|
|                                |Valid        |Missing        |Total        |
|                                |-----|-------|-------|-------|-----|-------|
|                                |N    |Percent|N      |Percent|N    |Percent|
|--------------------------------|-----|-------|-------|-------|-----|-------|
|CC.FOR_Mean * CC.T1TimedAcc_Mean|16   |33.3%  |32     |66.7%  |48   |100.0% |
|----------------------------------------------------------------------------|
 a Participant Private ID = 11450328.000000000

CC.FOR_Mean * CC.T1TimedAcc_Mean Crosstabulationa
Count 
|----------------|-----------------------|-----|
|                |CC.T1TimedAcc_Mean     |Total|
|                |------------------|----|     |
|                |.00               |1.00|     |
|-----------|----|------------------|----|-----|
|CC.FOR_Mean|1.00|0                 |1   |1    |
|           |----|------------------|----|-----|
|           |2.00|3                 |1   |4    |
|           |----|------------------|----|-----|
|           |3.00|6                 |2   |8    |
|           |----|------------------|----|-----|
|           |4.00|0                 |1   |1    |
|           |----|------------------|----|-----|
|           |5.00|1                 |0   |1    |
|           |----|------------------|----|-----|
|           |6.00|0                 |1   |1    |
|----------------|------------------|----|-----|
|Total           |10                |6   |16   |
|----------------------------------------------|
 a Participant Private ID = 11450328.000000000

Symmetric Measuresa
|------------------------|-----|--------------------------|--------------|------------------------|
|                        |Value|Asymptotic Standard Errorb|Approximate Tc|Approximate Significance|
|------------------|-----|-----|--------------------------|--------------|------------------------|
|Ordinal by Ordinal|Gamma|.111 |.419                      |.264          |.792                    |
|------------------------|-----|--------------------------|--------------|------------------------|
|N of Valid Cases        |16   |                          |              |                        |
|-------------------------------------------------------------------------------------------------|
 a Participant Private ID = 11450328.000000000
 b Not assuming the null hypothesis.
 c Using the asymptotic standard error assuming the null hypothesis.

Participant Private ID = 11450318.000000000
Case Processing Summarya
|----------------------------------|-------------------------------------------|
|                                  |Cases                                      |
|                                  |-------------|---------------|-------------|
|                                  |Valid        |Missing        |Total        |
|                                  |-----|-------|-------|-------|-----|-------|
|                                  |N    |Percent|N      |Percent|N    |Percent|
|----------------------------------|-----|-------|-------|-------|-----|-------|
|CC.FOR_Mean * CC.T2UntimedAcc_Mean|15   |31.3%  |33     |68.8%  |48   |100.0% |
|------------------------------------------------------------------------------|
 a Participant Private ID = 11450318.000000000

CC.FOR_Mean * CC.T2UntimedAcc_Mean Crosstabulationa
Count 
|----------------|-------------------------|-----|
|                |CC.T2UntimedAcc_Mean     |Total|
|                |--------------------|----|     |
|                |.00                 |1.00|     |
|-----------|----|--------------------|----|-----|
|CC.FOR_Mean|2.00|1                   |1   |2    |
|           |----|--------------------|----|-----|
|           |3.00|0                   |3   |3    |
|           |----|--------------------|----|-----|
|           |4.00|1                   |0   |1    |
|           |----|--------------------|----|-----|
|           |5.00|2                   |1   |3    |
|           |----|--------------------|----|-----|
|           |7.00|1                   |1   |2    |
|           |----|--------------------|----|-----|
|           |8.00|1                   |2   |3    |
|           |----|--------------------|----|-----|
|           |9.00|0                   |1   |1    |
|----------------|--------------------|----|-----|
|Total           |6                   |9   |15   |
|------------------------------------------------|
 a Participant Private ID = 11450318.000000000

Symmetric Measuresa
|------------------------|-----|--------------------------|--------------|------------------------|
|                        |Value|Asymptotic Standard Errorb|Approximate Tc|Approximate Significance|
|------------------|-----|-----|--------------------------|--------------|------------------------|
|Ordinal by Ordinal|Gamma|.042 |.337                      |.124          |.901                    |
|------------------------|-----|--------------------------|--------------|------------------------|
|N of Valid Cases        |15   |                          |              |                        |
|-------------------------------------------------------------------------------------------------|
 a Participant Private ID = 11450318.000000000
 b Not assuming the null hypothesis.
 c Using the asymptotic standard error assuming the null hypothesis.


Participant Private ID = 11450328.000000000
Case Processing Summarya
|----------------------------------|-------------------------------------------|
|                                  |Cases                                      |
|                                  |-------------|---------------|-------------|
|                                  |Valid        |Missing        |Total        |
|                                  |-----|-------|-------|-------|-----|-------|
|                                  |N    |Percent|N      |Percent|N    |Percent|
|----------------------------------|-----|-------|-------|-------|-----|-------|
|CC.FOR_Mean * CC.T2UntimedAcc_Mean|16   |33.3%  |32     |66.7%  |48   |100.0% |
|------------------------------------------------------------------------------|
 a Participant Private ID = 11450328.000000000

CC.FOR_Mean * CC.T2UntimedAcc_Mean Crosstabulationa
Count 
|----------------|-------------------------|-----|
|                |CC.T2UntimedAcc_Mean     |Total|
|                |--------------------|----|     |
|                |.00                 |1.00|     |
|-----------|----|--------------------|----|-----|
|CC.FOR_Mean|1.00|1                   |0   |1    |
|           |----|--------------------|----|-----|
|           |2.00|1                   |3   |4    |
|           |----|--------------------|----|-----|
|           |3.00|3                   |5   |8    |
|           |----|--------------------|----|-----|
|           |4.00|0                   |1   |1    |
|           |----|--------------------|----|-----|
|           |5.00|0                   |1   |1    |
|           |----|--------------------|----|-----|
|           |6.00|1                   |0   |1    |
|----------------|--------------------|----|-----|
|Total           |6                   |10  |16   |
|------------------------------------------------|
 a Participant Private ID = 11450328.000000000

Symmetric Measuresa
|------------------------|-----|--------------------------|--------------|------------------------|
|                        |Value|Asymptotic Standard Errorb|Approximate Tc|Approximate Significance|
|------------------|-----|-----|--------------------------|--------------|------------------------|
|Ordinal by Ordinal|Gamma|.095 |.420                      |.226          |.821                    |
|------------------------|-----|--------------------------|--------------|------------------------|
|N of Valid Cases        |16   |                          |              |                        |
|-------------------------------------------------------------------------------------------------|
 a Participant Private ID = 11450328.000000000
 b Not assuming the null hypothesis.
 c Using the asymptotic standard error assuming the null hypothesis.
All 14 variables are: 
CC.FOR_Mean * CC.T1TimedAcc_Mean
CC.FOR_Mean * CC.T2UntimedAcc_Mean
CC.FOR_Mean * CC.LogRT.T2Untimed_Mean
CC.FOR_Mean * CC.AnsChange_Mean
NC.FOR_Mean * NC.T1TimedAcc_Mean
NC.FOR_Mean * NC.T2UntimedAcc_Mean
NC.FOR_Mean * NC.LogRT.T2Untimed_Mean
NC.FOR_Mean * NC.AnsChange_Mean
CW.FOR_Mean * CW.T1TimedAcc_Mean
CW.FOR_Mean * CW.T2UntimedAcc_Mean
CW.FOR_Mean * CW.LogRT.T2Untimed_Mean
CW.FOR_Mean * CW.AnsChange_Mean
FORAll_Mean * T1AccAll_Mean
FORAll_Mean * T2AccAll_Mean

And I need the Gamma value extracted for each participant in each variable. So for participant 11450328 in the CC.FOR_Mean CC.T2UntimedAcc_Mean variable, that value is .095.

There are also negative values which would need to be recorded.

For the first two variables and first two participants, this is what I want the resulting .csv file to look like:

ParticipantPrivateID, CC.FORxCC.T1TimedAcc_Gamma, CC.FORxCC.T2UntimedAcc_Gamma,
11450318, .217, .042,
11450328, .111, .095,

I also want the variable names for the rest of the variables to be similarly structured.

Again, I've been using ChatGPT for this, so I have no idea if I'm on the right track. This is the code that has only come a bit close to doing what I need:

# Load required libraries
library(dplyr)
library(stringr)

# Define the file path
file_path <- "myfilepath"

# Read the file
lines <- readLines(file_path)

# Identify lines of interest
participant_lines <- grep("Participant Private ID =", lines)
symmetric_measures_lines <- grep("Symmetric Measuresa", lines)

# Initialize a list to store data for each variable
data_list <- list()

# Define variables
variables <- c(
  "CC.FORxT1TimedAcc_Gamma",
  "CC.FORxT2UntimedAcc_Gamma",
  "CC.FORxLogRT.T2Untimed_Gamma",
  "CC.FORxAnsChange_Gamma",
  "NC.FORxT1TimedAcc_Gamma",
  "NC.FORxT2UntimedAcc_Gamma",
  "NC.FORxLogRT.T2Untimed_Gamma",
  "NC.FORxAnsChange_Gamma",
  "CW.FORxT1TimedAcc_Gamma",
  "CW.FORxT2UntimedAcc_Gamma",
  "CW.FORxLogRT.T2Untimed_Gamma",
  "CW.FORxAnsChange_Gamma",
  "FORAllxT1AccAll_Gamma",
  "FORAllxT2AccAll_Gamma"
)

# Initialize a dataframe to store the results
final_results <- data.frame(ParticipantPrivateID = character(), stringsAsFactors = FALSE)

# Function to extract data
extract_data <- function(variable_name, index, lines) {
  participant_ids <- c()
  gamma_values <- c()
  
  for (i in seq_along(participant_lines)) {
    # Extract participant ID
    participant_id <- str_extract(lines[participant_lines[i]], "(?<=Participant Private ID = )[0-9]+")
    
    # Initialize value for this participant
    gamma_value <- NA
    
    # Find corresponding Symmetric Measures line
    if (index <= length(symmetric_measures_lines)) {
      # The data should be in the line following "Symmetric Measuresa"
      data_line_index <- symmetric_measures_lines[index] + 4 # Adjusted index for correct line
      
      if (data_line_index <= length(lines)) {
        # Extract the line
        line_data <- lines[data_line_index]
        
        # Extract numeric values, including those with '<'
        numeric_values <- str_extract_all(line_data, "\\d*\\.\\d+|\\d+\\.\\d*|\\d+|<\\s*\\d*\\.\\d+|<\\s*\\d+")[[1]]
        
        # Convert extracted values to numeric, handling special cases
        converted_values <- sapply(numeric_values, function(x) {
          if (grepl("^<", x)) {
            return(NA)  # Handle cases with '<' by returning NA
          } else {
            return(as.numeric(x))
          }
        })
        
        # Check if we have the expected number of values
        if (length(converted_values) >= 4) {
          gamma_value <- converted_values[1]
        }
      }
    }
    
    # Store extracted values
    participant_ids <- c(participant_ids, participant_id)
    gamma_values <- c(gamma_values, gamma_value)
  }
  
  # Create a dataframe for the variable
  data_df <- data.frame(
    ParticipantPrivateID = participant_ids,
    GammaValue = gamma_values,
    stringsAsFactors = FALSE
  ) %>%
    unique()  # Ensure unique rows by ParticipantPrivateID
  
  # Return the dataframe
  return(data_df)
}

# Loop through each variable and extract data
for (i in seq_along(variables)) {
  var_name <- variables[i]
  index <- i
  data_df <- extract_data(var_name, index, lines)
  
  # Add variable column to the final results
  final_results <- full_join(final_results, data_df, by = "ParticipantPrivateID")
  colnames(final_results)[ncol(final_results)] <- var_name
}

# Display the final dataset
print(final_results)

# Save the final dataset to a CSV file
write.csv(final_results, "final_dataset.csv", row.names = FALSE)

Problem: This code is recording all of the correct variables that I need, but it is not correctly recording the values for each variable and participant. It is repeating the values for the first participant for each participant. Each participant's own unique data needs to be extracted without any modifications.


Solution

  • Assume file_path and variables from the question and assume that the file contains just the lines shown in the question or if the file has all data uncomment the commented out v <- variables line. The components of variables appear to be in the same order as they appear in the input and we assume that that is the case.

    v is defined to be the variables that appear in the input and is 2 for the example in the question but would be all of variables (see commented out line) for the real data.

    Read the lines into lines, extract the lines that match the indicated regular expression and then from those lines extract all numbers using parse_number. Reform that into a 2 column matrix and then data frame.

    If this data frame has n rows and if there are length(v) variables (2 in the question) then there are n/length(v) participants and the first n/length(v) rows will contain the values of the first variable for each participant, the second n/length(v) rows will contain the values of the second variable for each participant and so on. Knowing that add a column for the variable names.

    Convert that from long form to wide form, fix the first column name and write it out.

    Assuming that the full input data follows the same pattern the only changes needed would be to uncomment the v <- variables line and add a filename to the write.csv line.

    library(dplyr)
    library(readr)
    library(tidyr)
    
    lines <- readLines(file_path)
    
    v <- variables[1:2]  # variables present - question has first 2
    # v <- variables     # uncomment for all variables
    
    lines %>%
      grep("^P.*ID|Gamma", ., value = TRUE) %>%
      parse_number() %>%
      matrix(ncol = 2, byrow = TRUE) %>%
      as.data.frame %>%
      mutate(name = rep(v, each = n() / length(v))) %>%
      pivot_wider(id_cols = "V1", names_from = "name", values_from = "V2") %>%
      rename(ParticipantPrivateID = "V1") %>%
      write.csv(quote = FALSE, row.names = FALSE)
    

    giving

    ParticipantPrivateID,CC.FORxT1TimedAcc_Gamma,CC.FORxT2UntimedAcc_Gamma
    11450318,0.217,0.042
    11450328,0.111,0.095
    

    Note

    Note that for purposes of testing we could use this instead of lines shown in the question. The code shown above works for this as well as the lines shown in the question giving the same result.

    lines <- c(
      "Participant Private ID = 11450318.000000000",
      "|Ordinal by Ordinal|Gamma|.217 |.334                      |.644          |.520                    |",
      "Participant Private ID = 11450328.000000000",
      "|Ordinal by Ordinal|Gamma|.111 |.419                      |.264          |.792                    |",
      "Participant Private ID = 11450318.000000000",
      "|Ordinal by Ordinal|Gamma|.042 |.337                      |.124          |.901                    |",
      "Participant Private ID = 11450328.000000000",
      "|Ordinal by Ordinal|Gamma|.095 |.420                      |.226          |.821                    |"
    )