Title: | The Ultimate Tool for Reading Data in Bulk |
---|---|
Description: | Designed to simplify and streamline the process of reading and processing large volumes of data in R, this package offers a collection of functions tailored for bulk data operations. It enables users to efficiently read multiple sheets from Microsoft Excel and Google Sheets workbooks, as well as various CSV files from a directory. The data is returned as organized data frames, facilitating further analysis and manipulation. Ideal for handling extensive data sets or batch processing tasks, bulkreadr empowers users to manage data in bulk effortlessly, saving time and effort in data preparation workflows. Additionally, the package seamlessly works with labelled data from SPSS and Stata. |
Authors: | Ezekiel Ogundepo [aut, cre] , Ernest Fokoue [ctb] , Golibe Ezeechesi [ctb], Fatimo Adebanjo [ctb], Isaac Ajao [ctb] |
Maintainer: | Ezekiel Ogundepo <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.1.1 |
Built: | 2024-11-09 05:24:09 UTC |
Source: | https://github.com/gbganalyst/bulkreadr |
convert_to_date()
parses an input vector into POSIXct date object. It is also powerful to convert from excel date number like 42370
into date value like 2016-01-01
.
convert_to_date(date_num_char, tz = "UTC")
convert_to_date(date_num_char, tz = "UTC")
date_num_char |
A character or numeric vector of dates |
tz |
Time zone indicator. If |
a vector of class Date
## ** heterogeneous dates ** dates <- c( 44869, "22.09.2022", NA, "02/27/92", "01-19-2022", "13-01- 2022", "2023", "2023-2", 41750.2, 41751.99, "11 07 2023", "2023-4" ) convert_to_date(dates)
## ** heterogeneous dates ** dates <- c( 44869, "22.09.2022", NA, "02/27/92", "01-19-2022", "13-01- 2022", "2023", "2023-2", 41750.2, 41751.99, "11 07 2023", "2023-4" ) convert_to_date(dates)
fill_missing_values()
is an efficient function that addresses missing
values in a data frame. It uses imputation by function, also known as
column-based imputation, to impute the missing values. For continuous
variables, it supports various methods of imputation, including minimum,
maximum, mean, median, harmonic mean, and geometric mean. For categorical
variables, missing values are replaced with the mode of the column. This
approach ensures accurate and consistent replacements derived from individual
columns, resulting in a complete and reliable dataset for improved analysis
and decision-making.
fill_missing_values( df, selected_variables = NULL, method = c("mean", "min", "max", "median", "harmonic", "geometric") )
fill_missing_values( df, selected_variables = NULL, method = c("mean", "min", "max", "median", "harmonic", "geometric") )
df |
A dataframe to process for missing value imputation. |
selected_variables |
An optional vector of variable names within |
method |
A character string specifying the imputation method for
continuous variables. Supported methods are |
A data frame with missing values imputed according to the specified
method
.
library(dplyr) # Assuming 'df' is the dataframe you want to process df <- tibble::tibble( Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5), Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7), Petal_Width = c(NA, 0.2, 1.2, 0.2, 1.3, 1.8, NA), Species = c("setosa", NA, "versicolor", "setosa", NA, "virginica", "setosa") ) # Impute using the mean method for continuous variables result_df_mean <- fill_missing_values(df, method = "mean") result_df_mean # Impute using the geometric mean for continuous variables and specify # variables `Petal_Length` and `Petal_Width`. result_df_geomean <- fill_missing_values(df, selected_variables = c ("Petal_Length", "Petal_Width"), method = "geometric") result_df_geomean # Impute missing values (NAs) in a grouped data frame # You can do that by using the following: sample_iris <- tibble::tibble( Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5), Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7), Petal_Width = c(0.3, 0.2, 1.2, 0.2, 1.3, 1.8, NA), Species = c("setosa", "setosa", "versicolor", "setosa", "virginica", "virginica", "setosa") ) sample_iris %>% group_by(Species) %>% group_split() %>% map_df(fill_missing_values, method = "median")
library(dplyr) # Assuming 'df' is the dataframe you want to process df <- tibble::tibble( Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5), Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7), Petal_Width = c(NA, 0.2, 1.2, 0.2, 1.3, 1.8, NA), Species = c("setosa", NA, "versicolor", "setosa", NA, "virginica", "setosa") ) # Impute using the mean method for continuous variables result_df_mean <- fill_missing_values(df, method = "mean") result_df_mean # Impute using the geometric mean for continuous variables and specify # variables `Petal_Length` and `Petal_Width`. result_df_geomean <- fill_missing_values(df, selected_variables = c ("Petal_Length", "Petal_Width"), method = "geometric") result_df_geomean # Impute missing values (NAs) in a grouped data frame # You can do that by using the following: sample_iris <- tibble::tibble( Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5), Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7), Petal_Width = c(0.3, 0.2, 1.2, 0.2, 1.3, 1.8, NA), Species = c("setosa", "setosa", "versicolor", "setosa", "virginica", "virginica", "setosa") ) sample_iris %>% group_by(Species) %>% group_split() %>% map_df(fill_missing_values, method = "median")
generate_dictionary()
creates a data dictionary from a specified data frame.
This function is particularly useful for understanding and documenting the
structure of your dataset, similar to data dictionaries in Stata or SPSS.
generate_dictionary(data)
generate_dictionary(data)
data |
a data frame or a survey object |
The function returns a tibble (a modern version of R's data frame) with the following columns:
position: An integer vector indicating the column position in the data frame.
variable: A character vector containing the names of the variables (columns).
description: A character vector with a human-readable description of each variable.
column type: A character vector specifying the data type (e.g., numeric, character) of each variable.
missing: An integer vector indicating the count of missing values for each variable.
levels: A list vector containing the levels for categorical variables, if applicable.
A tibble representing the data dictionary. Each row corresponds to a variable in the original data frame, providing detailed information about the variable's characteristics.
# Creating a data dictionary from an SPSS file file_path <- system.file("extdata", "Wages.sav", package = "bulkreadr") wage_data <- read_spss_data(file = file_path) generate_dictionary(wage_data)
# Creating a data dictionary from an SPSS file file_path <- system.file("extdata", "Wages.sav", package = "bulkreadr") wage_data <- read_spss_data(file = file_path) generate_dictionary(wage_data)
inspect_na()
summarizes the rate of missingness in each column of a data frame. For a grouped data frame, the rate of missingness is summarized separately for each group.
inspect_na(df)
inspect_na(df)
df |
A data frame |
The tibble returned contains the columns:
col_name, a character vector containing column names of df1.
cnt, an integer vector containing the number of missing values by column.
pcnt, the percentage of records in each columns that is missing.
A tibble summarizing the count and percentage of columnwise missingness for a data frame.
library(dplyr) # dataframe summary inspect_na(airquality) # grouped dataframe summary airquality %>% group_by(Month) %>% inspect_na()
library(dplyr) # dataframe summary inspect_na(airquality) # grouped dataframe summary airquality %>% group_by(Month) %>% inspect_na()
The look_for()
function is designed to emulate the functionality of the Stata lookfor
command in R. It provides a powerful tool for searching through large datasets, specifically targeting variable names, variable label descriptions, factor levels, and value labels. This function is handy for users working with extensive and complex datasets, enabling them to quickly and efficiently locate the variables of interest.
look_for( data, ..., labels = TRUE, values = TRUE, ignore.case = TRUE, details = c("basic", "none", "full") )
look_for( data, ..., labels = TRUE, values = TRUE, ignore.case = TRUE, details = c("basic", "none", "full") )
data |
a data frame or a survey object |
... |
optional list of keywords, a character string (or several
character strings), which can be formatted as a regular expression suitable
for a |
labels |
whether or not to search variable labels (descriptions);
|
values |
whether or not to search within values (factor levels or value
labels); |
ignore.case |
whether or not to make the keywords case sensitive;
|
details |
add details about each variable (full details could be time
consuming for big data frames, |
A tibble data frame featuring the variable position, name and description (if it exists) in the original data frame.
look_for(iris) # Look for a single keyword. look_for(iris, "petal") look_for(iris, "s")
look_for(iris) # Look for a single keyword. look_for(iris, "petal") look_for(iris, "s")
pull_out()
is similar to [
. It acts on vectors, matrices, arrays and lists to extract or replace parts. It is pleasant to use with the magrittr (%>%
) and base (|>
) operators.
pull_out()
will return an object of the same class as the input object.
good_choice <- letters %>% pull_out(c(5, 2, 1, 4)) good_choice iris %>% pull_out(, 1:4) %>% head()
good_choice <- letters %>% pull_out(c(5, 2, 1, 4)) good_choice iris %>% pull_out(, 1:4) %>% head()
read_csv_files_from_dir
reads all csv files from the "~/data"
directory and returns an appended dataframe. The resulting dataframe will be in the same order as the CSV files in the directory.
read_csv_files_from_dir(dir_path = ".", col_types = NULL, .id = NULL)
read_csv_files_from_dir(dir_path = ".", col_types = NULL, .id = NULL)
dir_path |
Path to the directory containing the CSV files. |
col_types |
One of If Column specifications created by Alternatively, you can use a compact string representation where each character represents one column:
By default, reading a file without a column specification will print a
message showing what |
.id |
The name of a column in which to store the file path. This is
useful when reading multiple input files and there is data in the file
paths, such as the data collection date. If |
A tibble. If there is any column type mismatch during data frames row binding, an error will occur. This is because R cannot combine columns of different types. For example, you cannot combine a column of integers with a column of characters.
read_excel_files_from_dir()
which reads Excel workbooks data from a directory.
directory <- system.file("csvfolder", package = "bulkreadr") read_csv_files_from_dir(dir_path = directory, .id = "cut") # Column types mismatch error -------------------------------------- # If the `read_csv_files_from_dir()` function complains about a data type mismatch, # then set the `col_types` argument to `"c"`. # This will make all the column types in the resulting dataframe be characters.
directory <- system.file("csvfolder", package = "bulkreadr") read_csv_files_from_dir(dir_path = directory, .id = "cut") # Column types mismatch error -------------------------------------- # If the `read_csv_files_from_dir()` function complains about a data type mismatch, # then set the `col_types` argument to `"c"`. # This will make all the column types in the resulting dataframe be characters.
read_excel_files_from_dir()
reads all Excel workbooks in the "~/data"
directory and returns an appended dataframe.
read_excel_files_from_dir(dir_path, col_types = NULL, .id = NULL)
read_excel_files_from_dir(dir_path, col_types = NULL, .id = NULL)
dir_path |
Path to the directory containing the |
col_types |
Either |
.id |
The name of an optional identifier column. Provide a string to create an output column that identifies each input. The column will use names if available, otherwise it will use positions. |
A tibble. If there is any column type mismatch during data frames row binding, an error will occur. This is because R cannot combine columns of different types. For example, you cannot combine a column of integers with a column of characters.
read_excel_workbook()
which imports data from multiple sheets of an Excel workbook
directory <- system.file("xlsxfolder", package = "bulkreadr") read_excel_files_from_dir(dir_path = directory, .id = "cut") # Column types mismatch error -------------------------------------- # If the `read_excel_files_from_dir()` function complains about a data type mismatch, # then set the `col_types` argument to `"text"`. # This will make all the column types in the resulting dataframe be characters.
directory <- system.file("xlsxfolder", package = "bulkreadr") read_excel_files_from_dir(dir_path = directory, .id = "cut") # Column types mismatch error -------------------------------------- # If the `read_excel_files_from_dir()` function complains about a data type mismatch, # then set the `col_types` argument to `"text"`. # This will make all the column types in the resulting dataframe be characters.
read_excel_workbook()
reads all the data from the sheets of an Excel workbook and return an appended dataframe.
read_excel_workbook(path, col_types = NULL, .id = NULL)
read_excel_workbook(path, col_types = NULL, .id = NULL)
path |
Path to the xls/xlsx file. |
col_types |
Either |
.id |
The name of an optional identifier column. Provide a string to create an output column that identifies each input. The column will use names if available, otherwise it will use positions. |
A tibble. If there is any column type mismatch during data frames row binding, an error will occur. This is because R cannot combine columns of different types. For example, you cannot combine a column of integers with a column of characters.
read_excel()
, which reads a Sheet of an Excel file into a data frame, and read_gsheets()
, which imports data from multiple sheets in a Google Sheets.
path <- system.file("extdata", "Diamonds.xlsx", package = "bulkreadr", mustWork = TRUE) read_excel_workbook(path = path, .id = "Year") # Column types mismatch error -------------------------------------- # If the `read_excel_workbook()` function complains about a data type mismatch, # then set the `col_types` argument to `"text"`. # This will make all the column types in the resulting DataFrame be characters.
path <- system.file("extdata", "Diamonds.xlsx", package = "bulkreadr", mustWork = TRUE) read_excel_workbook(path = path, .id = "Year") # Column types mismatch error -------------------------------------- # If the `read_excel_workbook()` function complains about a data type mismatch, # then set the `col_types` argument to `"text"`. # This will make all the column types in the resulting DataFrame be characters.
The read_gsheets()
function imports data from multiple sheets in a Google Sheets spreadsheet and appends the resulting dataframes from each sheet together to create a single dataframe. This function is a powerful tool for data analysis, as it allows you to easily combine data from multiple sheets into a single dataset.
read_gsheets(ss, col_types = NULL, .id = NULL)
read_gsheets(ss, col_types = NULL, .id = NULL)
ss |
Something that identifies a Google Sheet:
Processed through |
col_types |
Column types. Either |
.id |
The name of an optional identifier column. Provide a string to create an output column that identifies each input. The column will use names if available, otherwise it will use positions. |
A tibble. If there is any column type mismatch during data frames row binding, an error will occur. This is because R cannot combine columns of different types. For example, you cannot combine a column of integers with a column of characters.
read_sheet()
which reads a Google (spread)Sheet into a data frame.
sheet_id <- "1izO0mHu3L9AMySQUXGDn9GPs1n-VwGFSEoAKGhqVQh0" read_gsheets(ss = sheet_id, .id = "sheet.name") # Column types mismatch error -------------------------------------- # If the `read_gsheets()` function complains about a data type mismatch, # then set the `col_types` argument to `"c"`. # This will make all the column types in the resulting dataframe be characters. # For example, sheet_id <- "1rrjKAV05POre9lDVtHtZePTa8VROf1onVO47cHnhrTU" try(read_gsheets(ss = sheet_id)) # error, column types mismatch read_gsheets(ss = sheet_id, col_types = "c")
sheet_id <- "1izO0mHu3L9AMySQUXGDn9GPs1n-VwGFSEoAKGhqVQh0" read_gsheets(ss = sheet_id, .id = "sheet.name") # Column types mismatch error -------------------------------------- # If the `read_gsheets()` function complains about a data type mismatch, # then set the `col_types` argument to `"c"`. # This will make all the column types in the resulting dataframe be characters. # For example, sheet_id <- "1rrjKAV05POre9lDVtHtZePTa8VROf1onVO47cHnhrTU" try(read_gsheets(ss = sheet_id)) # error, column types mismatch read_gsheets(ss = sheet_id, col_types = "c")
read_spss_data()
is designed to seamlessly import data from an SPSS data (.sav
or .zsav
) files. It converts labelled variables into factors, a crucial step that enhances the ease of data manipulation and analysis within the R programming environment.
read_spss_data(file, label = FALSE)
read_spss_data(file, label = FALSE)
file |
The path to the SPSS data file. |
label |
Logical indicating whether to use variable labels as column names (default is FALSE). |
A tibble containing the data from the SPSS file.
read_stata_data()
which reads Stata data file and converts labelled variables into factors.
# Read an SPSS data file without converting variable labels as column names file_path <- system.file("extdata", "Wages.sav", package = "bulkreadr") data <- read_spss_data(file = file_path) data # Read an SPSS data file and convert variable labels as column names data <- read_spss_data(file = file_path, label = TRUE) data
# Read an SPSS data file without converting variable labels as column names file_path <- system.file("extdata", "Wages.sav", package = "bulkreadr") data <- read_spss_data(file = file_path) data # Read an SPSS data file and convert variable labels as column names data <- read_spss_data(file = file_path, label = TRUE) data
Read Stata data file
read_stata_data(file, label = FALSE)
read_stata_data(file, label = FALSE)
file |
The path to the Stata data file. |
label |
Logical indicating whether to use variable labels as column names (default is FALSE). |
A data frame containing the Stata data, with labeled variables converted to factors.
read_spss_data()
which reads SPSS data file and converts labelled variables into factors.
# Read Stata data file without converting variable labels as column names file_path <- system.file("extdata", "Wages.dta", package = "bulkreadr") data <- read_stata_data(file = file_path) data # Read Stata data file and convert variable labels as column names data <- read_stata_data(file = file_path, label = TRUE) data
# Read Stata data file without converting variable labels as column names file_path <- system.file("extdata", "Wages.dta", package = "bulkreadr") data <- read_stata_data(file = file_path) data # Read Stata data file and convert variable labels as column names data <- read_stata_data(file = file_path, label = TRUE) data