Package 'bulkreadr' reference manual

Title:	The Ultimate Tool for Reading Data in Bulk
Description:	Designed to simplify and streamline the process of reading and processing large volumes of data in R, this package offers a collection of functions tailored for bulk data operations. It enables users to efficiently read multiple sheets from Microsoft Excel and Google Sheets workbooks, as well as various CSV files from a directory. The data is returned as organized data frames, facilitating further analysis and manipulation. Ideal for handling extensive data sets or batch processing tasks, bulkreadr empowers users to manage data in bulk effortlessly, saving time and effort in data preparation workflows. Additionally, the package seamlessly works with labelled data from SPSS and Stata.
Authors:	Ezekiel Ogundepo [aut, cre] , Ernest Fokoue [ctb] , Golibe Ezeechesi [ctb], Fatimo Adebanjo [ctb], Isaac Ajao [ctb]
Maintainer:	Ezekiel Ogundepo <[email protected]>
License:	MIT + file LICENSE
Version:	1.1.1
Built:	2025-03-09 05:57:57 UTC
Source:	https://github.com/gbganalyst/bulkreadr

User friendly date parsing function

Description

convert_to_date() parses an input vector into POSIXct date object. It is also powerful to convert from excel date number like 42370 into date value like 2016-01-01.

Usage

convert_to_date(date_num_char, tz = "UTC")
convert_to_date(date_num_char, tz = "UTC")

Arguments

`date_num_char`	A character or numeric vector of dates
`tz`	Time zone indicator. If `NULL` (default), a Date object is returned. Otherwise a POSIXct with time zone attribute set to `tz`.

Value

a vector of class Date

Examples


## ** heterogeneous dates **

dates <- c(
  44869, "22.09.2022", NA, "02/27/92", "01-19-2022",
  "13-01-  2022", "2023", "2023-2", 41750.2, 41751.99,
  "11 07 2023", "2023-4"
)

convert_to_date(dates)

## ** heterogeneous dates **

dates <- c(
  44869, "22.09.2022", NA, "02/27/92", "01-19-2022",
  "13-01-  2022", "2023", "2023-2", 41750.2, 41751.99,
  "11 07 2023", "2023-4"
)

convert_to_date(dates)

Fill missing values in a data frame

Description

fill_missing_values() is an efficient function that addresses missing values in a data frame. It uses imputation by function, also known as column-based imputation, to impute the missing values. For continuous variables, it supports various methods of imputation, including minimum, maximum, mean, median, harmonic mean, and geometric mean. For categorical variables, missing values are replaced with the mode of the column. This approach ensures accurate and consistent replacements derived from individual columns, resulting in a complete and reliable dataset for improved analysis and decision-making.

Usage

fill_missing_values(
  df,
  selected_variables = NULL,
  method = c("mean", "min", "max", "median", "harmonic", "geometric")
)
fill_missing_values(
  df,
  selected_variables = NULL,
  method = c("mean", "min", "max", "median", "harmonic", "geometric")
)

Arguments

`df`	A dataframe to process for missing value imputation.
`selected_variables`	An optional vector of variable names within `df` for which missing values should be imputed. If `NULL` (default), imputation is applied to all variables in the data frame. Variables must be quoted.
`method`	A character string specifying the imputation method for continuous variables. Supported methods are `"min"`, `"max"`, `"mean"`, `"median"`, `"harmonic"`, and `"geometric"`. The default method is `"mean"`. For categorical variables, the `mode` is always used.

Value

A data frame with missing values imputed according to the specified method.

Examples


library(dplyr)

# Assuming 'df' is the dataframe you want to process

df <- tibble::tibble(
Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5),
Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7),
Petal_Width = c(NA, 0.2, 1.2, 0.2, 1.3, 1.8, NA),
Species = c("setosa", NA, "versicolor", "setosa",
           NA, "virginica", "setosa")
)

# Impute using the mean method for continuous variables

result_df_mean <- fill_missing_values(df, method = "mean")

result_df_mean

# Impute using the geometric mean for continuous variables and specify
# variables `Petal_Length` and `Petal_Width`.

result_df_geomean <- fill_missing_values(df, selected_variables = c
("Petal_Length", "Petal_Width"), method = "geometric")

result_df_geomean

# Impute missing values (NAs) in a grouped data frame

# You can do that by using the following:

sample_iris <- tibble::tibble(
Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5),
Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7),
Petal_Width = c(0.3, 0.2, 1.2, 0.2, 1.3, 1.8, NA),
Species = c("setosa", "setosa", "versicolor", "setosa",
           "virginica", "virginica", "setosa")
)

sample_iris %>%
group_by(Species) %>%
group_split() %>%
map_df(fill_missing_values, method = "median")


library(dplyr)

# Assuming 'df' is the dataframe you want to process

df <- tibble::tibble(
Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5),
Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7),
Petal_Width = c(NA, 0.2, 1.2, 0.2, 1.3, 1.8, NA),
Species = c("setosa", NA, "versicolor", "setosa",
           NA, "virginica", "setosa")
)

# Impute using the mean method for continuous variables

result_df_mean <- fill_missing_values(df, method = "mean")

result_df_mean

# Impute using the geometric mean for continuous variables and specify
# variables `Petal_Length` and `Petal_Width`.

result_df_geomean <- fill_missing_values(df, selected_variables = c
("Petal_Length", "Petal_Width"), method = "geometric")

result_df_geomean

# Impute missing values (NAs) in a grouped data frame

# You can do that by using the following:

sample_iris <- tibble::tibble(
Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5),
Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7),
Petal_Width = c(0.3, 0.2, 1.2, 0.2, 1.3, 1.8, NA),
Species = c("setosa", "setosa", "versicolor", "setosa",
           "virginica", "virginica", "setosa")
)

sample_iris %>%
group_by(Species) %>%
group_split() %>%
map_df(fill_missing_values, method = "median")

Create a data dictionary from labelled data

Description

generate_dictionary() creates a data dictionary from a specified data frame. This function is particularly useful for understanding and documenting the structure of your dataset, similar to data dictionaries in Stata or SPSS.

Usage

generate_dictionary(data)
generate_dictionary(data)

Arguments

data

a data frame or a survey object

Details

The function returns a tibble (a modern version of R's data frame) with the following columns:

position: An integer vector indicating the column position in the data frame.
variable: A character vector containing the names of the variables (columns).
description: A character vector with a human-readable description of each variable.
column type: A character vector specifying the data type (e.g., numeric, character) of each variable.
missing: An integer vector indicating the count of missing values for each variable.
levels: A list vector containing the levels for categorical variables, if applicable.

Value

A tibble representing the data dictionary. Each row corresponds to a variable in the original data frame, providing detailed information about the variable's characteristics.

Examples


# Creating a data dictionary from an SPSS file

file_path <- system.file("extdata", "Wages.sav", package = "bulkreadr")

wage_data <- read_spss_data(file = file_path)

generate_dictionary(wage_data)


# Creating a data dictionary from an SPSS file

file_path <- system.file("extdata", "Wages.sav", package = "bulkreadr")

wage_data <- read_spss_data(file = file_path)

generate_dictionary(wage_data)

Summarize missingness in data frame columns

Description

inspect_na() summarizes the rate of missingness in each column of a data frame. For a grouped data frame, the rate of missingness is summarized separately for each group.

Usage

inspect_na(df)
inspect_na(df)

Arguments

`df`	A data frame

Details

The tibble returned contains the columns:

col_name, a character vector containing column names of df1.
cnt, an integer vector containing the number of missing values by column.
pcnt, the percentage of records in each columns that is missing.

Value

A tibble summarizing the count and percentage of columnwise missingness for a data frame.

Examples


library(dplyr)

# dataframe summary

inspect_na(airquality)

# grouped dataframe summary

airquality %>%
group_by(Month) %>%
 inspect_na()

library(dplyr)

# dataframe summary

inspect_na(airquality)

# grouped dataframe summary

airquality %>%
group_by(Month) %>%
 inspect_na()

Look for keywords variable names and descriptions in labelled data

Description

The look_for() function is designed to emulate the functionality of the Stata lookfor command in R. It provides a powerful tool for searching through large datasets, specifically targeting variable names, variable label descriptions, factor levels, and value labels. This function is handy for users working with extensive and complex datasets, enabling them to quickly and efficiently locate the variables of interest.

Usage

look_for(
  data,
  ...,
  labels = TRUE,
  values = TRUE,
  ignore.case = TRUE,
  details = c("basic", "none", "full")
)
look_for(
  data,
  ...,
  labels = TRUE,
  values = TRUE,
  ignore.case = TRUE,
  details = c("basic", "none", "full")
)

Arguments

`data`	a data frame or a survey object
`...`	optional list of keywords, a character string (or several character strings), which can be formatted as a regular expression suitable for a `base::grep()` pattern, or a vector of keywords; displays all variables if not specified
`labels`	whether or not to search variable labels (descriptions); `TRUE` by default
`values`	whether or not to search within values (factor levels or value labels); `TRUE` by default
`ignore.case`	whether or not to make the keywords case sensitive; `TRUE` by default (case is ignored during matching)
`details`	add details about each variable (full details could be time consuming for big data frames, `FALSE` is equivalent to `"none"` and `TRUE` to `"full"`)

Value

A tibble data frame featuring the variable position, name and description (if it exists) in the original data frame.

Examples


look_for(iris)

# Look for a single keyword.

look_for(iris, "petal")

look_for(iris, "s")

look_for(iris)

# Look for a single keyword.

look_for(iris, "petal")

look_for(iris, "s")

Extract or replace parts of an object

Description

pull_out() is similar to [. It acts on vectors, matrices, arrays and lists to extract or replace parts. It is pleasant to use with the magrittr (⁠%>%⁠) and base (⁠|>⁠) operators.

Value

pull_out() will return an object of the same class as the input object.

Examples


good_choice <- letters %>%
  pull_out(c(5, 2, 1, 4))

good_choice

iris %>%
  pull_out(, 1:4) %>%
  head()

good_choice <- letters %>%
  pull_out(c(5, 2, 1, 4))

good_choice

iris %>%
  pull_out(, 1:4) %>%
  head()

Reads all CSV files from a directory

Description

read_csv_files_from_dir reads all csv files from the "~/data" directory and returns an appended dataframe. The resulting dataframe will be in the same order as the CSV files in the directory.

Usage

read_csv_files_from_dir(dir_path = ".", col_types = NULL, .id = NULL)
read_csv_files_from_dir(dir_path = ".", col_types = NULL, .id = NULL)

Arguments

dir_path

Path to the directory containing the CSV files.

col_types

One of NULL, a cols() specification, or a string. See vignette("readr") for more details.

If NULL, all column types will be inferred from guess_max rows of the input, interspersed throughout the file. This is convenient (and fast), but not robust. If the guessed types are wrong, you'll need to increase guess_max or supply the correct types yourself.

Column specifications created by list() or cols() must contain one column specification for each column. If you only want to read a subset of the columns, use cols_only().

Alternatively, you can use a compact string representation where each character represents one column:

c = character
i = integer
n = number
d = double
l = logical
f = factor
D = date
T = date time
t = time
? = guess
_ or - = skip

By default, reading a file without a column specification will print a message showing what readr guessed they were. To remove this message, set show_col_types = FALSE or set options(readr.show_col_types = FALSE).

.id

The name of a column in which to store the file path. This is useful when reading multiple input files and there is data in the file paths, such as the data collection date. If NULL (the default) no extra column is created.

Value

A tibble. If there is any column type mismatch during data frames row binding, an error will occur. This is because R cannot combine columns of different types. For example, you cannot combine a column of integers with a column of characters.

Examples


directory <- system.file("csvfolder",  package = "bulkreadr")

read_csv_files_from_dir(dir_path = directory, .id = "cut")

# Column types mismatch error --------------------------------------
# If the `read_csv_files_from_dir()` function complains about a data type mismatch,
# then set the `col_types` argument to `"c"`.
# This will make all the column types in the resulting dataframe be characters.


directory <- system.file("csvfolder",  package = "bulkreadr")

read_csv_files_from_dir(dir_path = directory, .id = "cut")

# Column types mismatch error --------------------------------------
# If the `read_csv_files_from_dir()` function complains about a data type mismatch,
# then set the `col_types` argument to `"c"`.
# This will make all the column types in the resulting dataframe be characters.

Read Excel Workbooks data from a directory

Description

read_excel_files_from_dir() reads all Excel workbooks in the "~/data" directory and returns an appended dataframe.

Usage

read_excel_files_from_dir(dir_path, col_types = NULL, .id = NULL)
read_excel_files_from_dir(dir_path, col_types = NULL, .id = NULL)

Arguments

`dir_path`	Path to the directory containing the `xls`/`xlsx` files.
`col_types`	Either `NULL` to guess all from the spreadsheet or a character vector containing one entry per column from these options: "skip", "guess", "logical", "numeric", "date", "text" or "list". If exactly one `col_type` is specified, it will be recycled. The content of a cell in a skipped column is never read and that column will not appear in the data frame output. A list cell loads a column as a list of length 1 vectors, which are typed using the type guessing logic from `col_types = NULL`, but on a cell-by-cell basis.
`.id`	The name of an optional identifier column. Provide a string to create an output column that identifies each input. The column will use names if available, otherwise it will use positions.

Value

Examples


directory <- system.file("xlsxfolder",  package = "bulkreadr")

read_excel_files_from_dir(dir_path = directory, .id = "cut")

# Column types mismatch error --------------------------------------
# If the `read_excel_files_from_dir()` function complains about a data type mismatch,
# then set the `col_types` argument to `"text"`.
# This will make all the column types in the resulting dataframe be characters.


directory <- system.file("xlsxfolder",  package = "bulkreadr")

read_excel_files_from_dir(dir_path = directory, .id = "cut")

# Column types mismatch error --------------------------------------
# If the `read_excel_files_from_dir()` function complains about a data type mismatch,
# then set the `col_types` argument to `"text"`.
# This will make all the column types in the resulting dataframe be characters.

Import data from multiple sheets of an Excel workbook

Description

read_excel_workbook() reads all the data from the sheets of an Excel workbook and return an appended dataframe.

Usage

read_excel_workbook(path, col_types = NULL, .id = NULL)
read_excel_workbook(path, col_types = NULL, .id = NULL)

Arguments

`path`	Path to the xls/xlsx file.
`col_types`	Either `NULL` to guess all from the spreadsheet or a character vector containing one entry per column from these options: "skip", "guess", "logical", "numeric", "date", "text" or "list". If exactly one `col_type` is specified, it will be recycled. The content of a cell in a skipped column is never read and that column will not appear in the data frame output. A list cell loads a column as a list of length 1 vectors, which are typed using the type guessing logic from `col_types = NULL`, but on a cell-by-cell basis.
`.id`	The name of an optional identifier column. Provide a string to create an output column that identifies each input. The column will use names if available, otherwise it will use positions.

Value

Examples


path <- system.file("extdata", "Diamonds.xlsx", package = "bulkreadr", mustWork = TRUE)

read_excel_workbook(path = path, .id = "Year")


# Column types mismatch error --------------------------------------
# If the `read_excel_workbook()` function complains about a data type mismatch,
# then set the `col_types` argument to `"text"`.
# This will make all the column types in the resulting DataFrame be characters.

path <- system.file("extdata", "Diamonds.xlsx", package = "bulkreadr", mustWork = TRUE)

read_excel_workbook(path = path, .id = "Year")


# Column types mismatch error --------------------------------------
# If the `read_excel_workbook()` function complains about a data type mismatch,
# then set the `col_types` argument to `"text"`.
# This will make all the column types in the resulting DataFrame be characters.

Import data from multiple sheets in Google Sheets

Description

The read_gsheets() function imports data from multiple sheets in a Google Sheets spreadsheet and appends the resulting dataframes from each sheet together to create a single dataframe. This function is a powerful tool for data analysis, as it allows you to easily combine data from multiple sheets into a single dataset.

Usage

read_gsheets(ss, col_types = NULL, .id = NULL)
read_gsheets(ss, col_types = NULL, .id = NULL)

Arguments

ss

Something that identifies a Google Sheet:

its file id as a string or drive_id
a URL from which we can recover the id
a one-row dribble, which is how googledrive represents Drive files
an instance of googlesheets4_spreadsheet, which is what gs4_get() returns

Processed through as_sheets_id().

col_types

Column types. Either NULL to guess all from the spreadsheet or a string of readr-style shortcodes, with one character or code per column. If exactly one col_type is specified, it is recycled. See Column Specification for more.

.id

The name of an optional identifier column. Provide a string to create an output column that identifies each input. The column will use names if available, otherwise it will use positions.

Value

Examples



sheet_id <- "1izO0mHu3L9AMySQUXGDn9GPs1n-VwGFSEoAKGhqVQh0"

read_gsheets(ss = sheet_id, .id = "sheet.name")

# Column types mismatch error --------------------------------------
# If the `read_gsheets()` function complains about a data type mismatch,
# then set the `col_types` argument to `"c"`.
# This will make all the column types in the resulting dataframe be characters.

# For example,


sheet_id <- "1rrjKAV05POre9lDVtHtZePTa8VROf1onVO47cHnhrTU"

try(read_gsheets(ss = sheet_id)) # error, column types mismatch

read_gsheets(ss = sheet_id, col_types = "c")


sheet_id <- "1izO0mHu3L9AMySQUXGDn9GPs1n-VwGFSEoAKGhqVQh0"

read_gsheets(ss = sheet_id, .id = "sheet.name")

# Column types mismatch error --------------------------------------
# If the `read_gsheets()` function complains about a data type mismatch,
# then set the `col_types` argument to `"c"`.
# This will make all the column types in the resulting dataframe be characters.

# For example,


sheet_id <- "1rrjKAV05POre9lDVtHtZePTa8VROf1onVO47cHnhrTU"

try(read_gsheets(ss = sheet_id)) # error, column types mismatch

read_gsheets(ss = sheet_id, col_types = "c")

Read SPSS data file

Description

read_spss_data() is designed to seamlessly import data from an SPSS data (.sav or .zsav) files. It converts labelled variables into factors, a crucial step that enhances the ease of data manipulation and analysis within the R programming environment.

Usage

read_spss_data(file, label = FALSE)
read_spss_data(file, label = FALSE)

Arguments

`file`	The path to the SPSS data file.
`label`	Logical indicating whether to use variable labels as column names (default is FALSE).

Value

A tibble containing the data from the SPSS file.

Examples

# Read an SPSS data file without converting variable labels as column names

file_path <- system.file("extdata", "Wages.sav", package = "bulkreadr")

data <- read_spss_data(file = file_path)

data

# Read an SPSS data file and convert variable labels as column names

data <- read_spss_data(file = file_path, label = TRUE)

data

# Read an SPSS data file without converting variable labels as column names

file_path <- system.file("extdata", "Wages.sav", package = "bulkreadr")

data <- read_spss_data(file = file_path)

data

# Read an SPSS data file and convert variable labels as column names

data <- read_spss_data(file = file_path, label = TRUE)

data

Read Stata data file

Description

Read Stata data file

Usage

read_stata_data(file, label = FALSE)
read_stata_data(file, label = FALSE)

Arguments

`file`	The path to the Stata data file.
`label`	Logical indicating whether to use variable labels as column names (default is FALSE).

Value

A data frame containing the Stata data, with labeled variables converted to factors.

Examples

# Read Stata data file without converting variable labels as column names

file_path <- system.file("extdata", "Wages.dta", package = "bulkreadr")

data <- read_stata_data(file = file_path)

data

# Read Stata data file and convert variable labels as column names

data <- read_stata_data(file = file_path, label = TRUE)

data

# Read Stata data file without converting variable labels as column names

file_path <- system.file("extdata", "Wages.dta", package = "bulkreadr")

data <- read_stata_data(file = file_path)

data

# Read Stata data file and convert variable labels as column names

data <- read_stata_data(file = file_path, label = TRUE)

data

Package 'bulkreadr'

Help Index

User friendly date parsing function

Description

Usage

Arguments

Value

Examples

Fill missing values in a data frame

Description

Usage

Arguments

Value

Examples

Create a data dictionary from labelled data

Description

Usage

Arguments

Details

Value

Examples

Summarize missingness in data frame columns

Description

Usage

Arguments

Details

Value

Examples

Look for keywords variable names and descriptions in labelled data

Description

Usage

Arguments

Value

Examples

Extract or replace parts of an object

Description

Value

Examples

Reads all CSV files from a directory

Description

Usage

Arguments

Value

See Also

Examples

Read Excel Workbooks data from a directory

Description

Usage

Arguments

Value

See Also

Examples

Import data from multiple sheets of an Excel workbook

Description

Usage

Arguments

Value

See Also

Examples

Import data from multiple sheets in Google Sheets

Description

Usage

Arguments

Value

See Also

Examples

Read SPSS data file

Description

Usage

Arguments

Value

See Also

Examples

Read Stata data file

Description

Usage

Arguments

Value

See Also

Examples