Title: | Apply Functions to Blocks of Files |
---|---|
Description: | Read and process a large delimited file block by block. A block consists of all the contiguous rows that have the same value in the first field. The result can be returned as a list or a data.table, or even directly printed to an output file. |
Authors: | Federico Marotta [aut, cre] |
Maintainer: | Federico Marotta <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.3.0 |
Built: | 2025-01-27 03:55:03 UTC |
Source: | https://github.com/fmarotta/fplyr |
data.table
This function is useful to quickly glance at a big chunked file. It is similar
to the head()
function, except that it does not read the first few lines, but
rather the first few blocks of the file. By default, only the first block will be read;
it is not advisable to read a large number of blocks in this way because they may
occupy a lot of memory. The blocks are saved to a data.table
. See ?fplyr
for the definitions of chunked file and block.
fdply( input, nblocks = 1, key.sep = "\t", sep = "\t", skip = 0, colClasses = NULL, header = TRUE, stringsAsFactors = FALSE, select = NULL, drop = NULL, col.names = NULL, parallel = 1 )
fdply( input, nblocks = 1, key.sep = "\t", sep = "\t", skip = 0, colClasses = NULL, header = TRUE, stringsAsFactors = FALSE, select = NULL, drop = NULL, col.names = NULL, parallel = 1 )
input |
Path of the input file. |
nblocks |
The number of blocks to read. |
key.sep |
The character that delimits the first field from the rest. |
sep |
The field delimiter (often equal to |
skip |
Number of lines to skip at the beginning of the file |
colClasses |
Vector or list specifying the class of each field. |
header |
Whether the file has a header. |
stringsAsFactors |
Whether to convert strings into factors. |
select |
The columns (names or numbers) to be read. |
drop |
The columns (names or numbers) not to be read. |
col.names |
Names of the columns. |
parallel |
Number of cores to use. |
A data.table
containing the file truncated to the number of
blocks specified.
fdply: from file to data.table
Suppose you want to process each block of a file and the result is again
a data.table
that you want to print to some output file. One possible
approach is to use l <- flply(...)
followed by do.call(rbind, l)
and fwrite
, but this would be slow. ffply
offers a faster
solution to this problem.
ffply( input, output = "", FUN, ..., key.sep = "\t", sep = "\t", skip = 0, header = TRUE, nblocks = Inf, stringsAsFactors = FALSE, colClasses = NULL, select = NULL, drop = NULL, col.names = NULL, parallel = 1 )
ffply( input, output = "", FUN, ..., key.sep = "\t", sep = "\t", skip = 0, header = TRUE, nblocks = Inf, stringsAsFactors = FALSE, colClasses = NULL, select = NULL, drop = NULL, col.names = NULL, parallel = 1 )
input |
Path of the input file. |
output |
String containing the path to the output file. |
FUN |
Function to be applied to each block. It must take at least two arguments,
the first of which is a |
... |
Additional arguments to be passed to FUN. |
key.sep |
The character that delimits the first field from the rest. |
sep |
The field delimiter (often equal to |
skip |
Number of lines to skip at the beginning of the file |
header |
Whether the file has a header. |
nblocks |
The number of blocks to read. |
stringsAsFactors |
Whether to convert strings into factors. |
colClasses |
Vector or list specifying the class of each field. |
select |
The columns (names or numbers) to be read. |
drop |
The columns (names or numbers) not to be read. |
col.names |
Names of the columns. |
parallel |
Number of cores to use. |
Returns NULL invisibly. As a side effect,
writes the processed data.table
to the output file.
ffply: from file to file
f1 <- system.file("extdata", "dt_iris.csv", package = "fplyr") f2 <- tempfile() # Copy the first two blocks from f1 into f2 to obtain a shorter but # consistent version of the original input file. ffply(f1, f2, function(d, by) {return(d)}, nblocks = 2) # Compute the mean of the columns for each species ffply(f1, f2, function(d, by) d[, lapply(.SD, mean)]) # Reshape the file, block by block ffply(f1, f2, function(d, by) { val <- do.call(c, d) var <- rep(names(d), each = nrow(d)) data.table(Var = var, Val = val) })
f1 <- system.file("extdata", "dt_iris.csv", package = "fplyr") f2 <- tempfile() # Copy the first two blocks from f1 into f2 to obtain a shorter but # consistent version of the original input file. ffply(f1, f2, function(d, by) {return(d)}, nblocks = 2) # Compute the mean of the columns for each species ffply(f1, f2, function(d, by) d[, lapply(.SD, mean)]) # Reshape the file, block by block ffply(f1, f2, function(d, by) { val <- do.call(c, d) var <- rep(names(d), each = nrow(d)) data.table(Var = var, Val = val) })
With flply()
you can apply a function to each block of the file separately.
The result of each function is saved into a list and returned. flply()
is similar to lapply()
, except that it applies the function to each
block of the file rather than to each element of a list. It is also similar
to by()
, except that it does not read the whole file into memory, but
each block is processed as soon as it is read from the disk.
flply( input, FUN, ..., key.sep = "\t", sep = "\t", skip = 0, header = TRUE, nblocks = Inf, stringsAsFactors = FALSE, colClasses = NULL, select = NULL, drop = NULL, col.names = NULL, parallel = 1 )
flply( input, FUN, ..., key.sep = "\t", sep = "\t", skip = 0, header = TRUE, nblocks = Inf, stringsAsFactors = FALSE, colClasses = NULL, select = NULL, drop = NULL, col.names = NULL, parallel = 1 )
input |
Path of the input file. |
FUN |
A function to be applied to each block. The first argument to the
function must be a |
... |
Additional arguments to be passed to FUN. |
key.sep |
The character that delimits the first field from the rest. |
sep |
The field delimiter (often equal to |
skip |
Number of lines to skip at the beginning of the file |
header |
Whether the file has a header. |
nblocks |
The number of blocks to read. |
stringsAsFactors |
Whether to convert strings into factors. |
colClasses |
Vector or list specifying the class of each field. |
select |
The columns (names or numbers) to be read. |
drop |
The columns (names or numbers) not to be read. |
col.names |
Names of the columns. |
parallel |
Number of cores to use. |
Returns a list containing, for each chunk, the result of the processing.
flply: from file to list
f <- system.file("extdata", "dt_iris.csv", package = "fplyr") # Compute, within each block, the correlation between Sepal.Length and Petal.Length flply(f, function(d) cor(d$Sepal.Length, d$Petal.Length)) # Summarise each block flply(f, summary) # Make a different linear model for each block block.lm <- function(d) { lm(Sepal.Length ~ ., data = d[, !"Species"]) } lm.list <- flply(f, block.lm)
f <- system.file("extdata", "dt_iris.csv", package = "fplyr") # Compute, within each block, the correlation between Sepal.Length and Petal.Length flply(f, function(d) cor(d$Sepal.Length, d$Petal.Length)) # Summarise each block flply(f, summary) # Make a different linear model for each block block.lm <- function(d) { lm(Sepal.Length ~ ., data = d[, !"Species"]) } lm.list <- flply(f, block.lm)
Sometimes a file should be processed in many different ways. fmply()
applies a function to each block of the file; the function should return a
list of m data.table
s, each of which is written to a different
output file. Optionally, the function can return a list of m + 1,
where the first m elements are data.table
s and are written
to the output files, while the last element is returned as in flply()
.
fmply( input, outputs, FUN, ..., key.sep = "\t", sep = "\t", skip = 0, header = TRUE, nblocks = Inf, stringsAsFactors = FALSE, colClasses = NULL, select = NULL, drop = NULL, col.names = NULL, parallel = 1 )
fmply( input, outputs, FUN, ..., key.sep = "\t", sep = "\t", skip = 0, header = TRUE, nblocks = Inf, stringsAsFactors = FALSE, colClasses = NULL, select = NULL, drop = NULL, col.names = NULL, parallel = 1 )
input |
Path of the input file. |
outputs |
Vector of m paths for the output files. |
FUN |
A function to apply to each block. Takes as input a |
... |
Additional arguments to be passed to FUN. |
key.sep |
The character that delimits the first field from the rest. |
sep |
The field delimiter (often equal to |
skip |
Number of lines to skip at the beginning of the file |
header |
Whether the file has a header. |
nblocks |
The number of blocks to read. |
stringsAsFactors |
Whether to convert strings into factors. |
colClasses |
Vector or list specifying the class of each field. |
select |
The columns (names or numbers) to be read. |
drop |
The columns (names or numbers) not to be read. |
col.names |
Names of the columns. |
parallel |
Number of cores to use. |
If FUN
returns m elements, fmply()
returns
NULL invisibly. If FUN
returns m + 1
elements, fmply()
returns the list of all the last elements. As a
side effect, it writes the first m outputs of FUN
to the
outputs
files.
fmply: from file to multiple files
fin <- system.file("extdata", "dt_iris.csv", package = "fplyr") fout1 <- tempfile() fout2 <- "" # Copy the input file to tempfile as it is, and, at the same time, print # a summary to the console fmply(fin, c(fout1, fout2), function(d) { list(d, data.table(unclass(summary(d)))) }) fout3 <- tempfile() fout4 <- tempfile() # Use linear and polynomial regression and print the outputs to two files fmply(fin, c(fout3, fout4), function(d) { lr.fit <- lm(Sepal.Length ~ ., data = d[, !"Species"]) lr.summ <- data.table(Species = d$Species[1], t(coefficients(lr.fit))) pr.fit <- lm(Sepal.Length ~ poly(as.matrix(d[, 3:5]), degree = 3), data = d[, !"Species"]) pr.summ <- data.table(Species = d$Species[1], t(coefficients(pr.fit))) list(lr.summ, pr.summ) })
fin <- system.file("extdata", "dt_iris.csv", package = "fplyr") fout1 <- tempfile() fout2 <- "" # Copy the input file to tempfile as it is, and, at the same time, print # a summary to the console fmply(fin, c(fout1, fout2), function(d) { list(d, data.table(unclass(summary(d)))) }) fout3 <- tempfile() fout4 <- tempfile() # Use linear and polynomial regression and print the outputs to two files fmply(fin, c(fout3, fout4), function(d) { lr.fit <- lm(Sepal.Length ~ ., data = d[, !"Species"]) lr.summ <- data.table(Species = d$Species[1], t(coefficients(lr.fit))) pr.fit <- lm(Sepal.Length ~ poly(as.matrix(d[, 3:5]), degree = 3), data = d[, !"Species"]) pr.summ <- data.table(Species = d$Species[1], t(coefficients(pr.fit))) list(lr.summ, pr.summ) })
ftply
takes as input the path to a file and a function, and
returns a data.table
. It is a faster equivalent to using
l <- flply(...)
followed by do.call(rbind, l)
.
ftply( input, FUN = function(d, by) d, ..., key.sep = "\t", sep = "\t", skip = 0, header = TRUE, nblocks = Inf, stringsAsFactors = FALSE, colClasses = NULL, select = NULL, drop = NULL, col.names = NULL, parallel = 1 )
ftply( input, FUN = function(d, by) d, ..., key.sep = "\t", sep = "\t", skip = 0, header = TRUE, nblocks = Inf, stringsAsFactors = FALSE, colClasses = NULL, select = NULL, drop = NULL, col.names = NULL, parallel = 1 )
input |
Path of the input file. |
FUN |
Function to be applied to each block. It must take at least two arguments,
the first of which is a |
... |
Additional arguments to be passed to FUN. |
key.sep |
The character that delimits the first field from the rest. |
sep |
The field delimiter (often equal to |
skip |
Number of lines to skip at the beginning of the file |
header |
Whether the file has a header. |
nblocks |
The number of blocks to read. |
stringsAsFactors |
Whether to convert strings into factors. |
colClasses |
Vector or list specifying the class of each field. |
select |
The columns (names or numbers) to be read. |
drop |
The columns (names or numbers) not to be read. |
col.names |
Names of the columns. |
parallel |
Number of cores to use. |
ftply
is similar to ffply
, but while the latter writes
to disk the result of the processing after each block, the former
keeps the result in memory until all the file has been processed, and
then returns the complete data.table
.
Returns a data.table
with the results of the
processing.
ftply: from file to data.table
f1 <- system.file("extdata", "dt_iris.csv", package = "fplyr") # Compute the mean of the columns for each species ftply(f1, function(d, by) d[, lapply(.SD, mean)]) # Read only the first two blocks ftply(f1, nblocks = 2)
f1 <- system.file("extdata", "dt_iris.csv", package = "fplyr") # Compute the mean of the columns for each species ftply(f1, function(d, by) d[, lapply(.SD, mean)]) # Read only the first two blocks ftply(f1, nblocks = 2)