{smcl} {* 18feb2011}{...} {hi:help collect} {hline} {title:Title} {cmd:collect} {hline 2} Concatenate multiple files {title:Syntax} {pmore}{cmdab:collect} {it:epath} [{cmd:;} {it:epath} {cmd:...}] [{cmd:,} {opt k:eep(evarlist)} {opt t:est} {opt nom:ore} {opt g:enerate(newvar)} {opt app:end}] {pstd}Where each {it:epath} is a filepath that can include wildcards and/or regular expressions. To use regular expressions, a segment of the path must be of the form: {pmore}{cmd:regex(}{it:regular_expression}{cmd:)} {pmore}For example, {pmore}{cmd:directory/subdir/regex([a-z]+[0-9]+)/just this file.dta} {pmore}would select only files called {cmd:just this file.dta} from any directory in {cmd:subdir/} whose name was {it:letters} followed by {it:numbers}. {pstd}And where {it:evarlist} is an extended {help elemlist:{it:elemlist}} that allows for renaming variables. Any portion(s) of the entire {help elemlist:{it:elemlist}} can be written as: {pmore}{cmd:(}{it:portion}{cmd:)->}{varname} {pstd}where {it:portion} may be any {help elemlist:{it:elemlist}}, including ranges, but if it resolves to multiple variables in a single dataset, an error will result. {title:Description} {pstd}{cmd:collect} concatenates multiple files into a single dataset in memory. If StatTransfer is available, any of the files may be in non-Stata formats. Variables can be combined and renamed. {pstd}Before collecting any data, {cmd:collect} gives a fairly detailed report of what it is about to do. It shows: {phang2}o-{space 2}The paths and id numbers of data files it will use {pmore}and, for each data file: {phang2}o-{space 2}Patterns that do not match any variables {phang2}o-{space 2}The number of variables that will be kept{break} (This number is displayed as a link that will display all the variables in the dataset, highlighting the ones to be kept.) {phang2}o-{space 2}Any renaming errors{break} (If there are errors, all of them will be reported before the command aborts.) {pstd}{cmd:collect} will also report an estimate of the memory required for the entire collection, and if that is more than Stata's current memory, it will attempt to increase memory before collecting the data. When all the files to be collected are Stata data files, the estimate should be spot-on, but when there are foreign format files, it could be significantly off. See Remarks, below. {title:Options} {phang}{opt k:eep(evarlist)} specifies the variables to be kept from any of the files collected. The variables in {opt keep(evarlist)} do not need to be present in every (or indeed any) file. If they are present in any of the collected files, they will be kept in the final data file. {phang}{opt t:est} causes {cmd:collect} to report on what it would do (ie, which files it would use, variables used or not found, any errors, memory required), without actually collecting the data. {phang}{opt g:enerate(newvar)} adds a variable to the final data file identifying the file each observation came from. The file names and ids are included in {cmd:collect}'s report, and {newvar} is labeled with the file names. {phang}{opt nom:ore} prevents {cmd:collect} from increasing Stata's memory before collecting the data. When the files to be collected include non-Stata format files, the estimated memory required can be significantly off. If you believe you have a better estimate than {cmd:collect}, use {opt nom:ore}. {phang}{opt app:end} causes {cmd:collect} to begin with the data in memory, and then append the data collected from the specified files. {title:Remarks} {title:Memory} {pstd}My choice among many less-than-ideal possibilities was to have {cmd:collect} estimate the number of observations in non-Stata datafiles based on the physical size of the file. {pstd}For SAS files, this is likely to be a significant overestimate. In general, that should be ok: Memory will be increased enough, and the data will be collected. With large enough data files, though, and with overestimates perhaps ~50%, automatically asking for all that memory could be an irritation. So maybe the {opt nom:ore} option will help. {pstd}For spreadsheets, the estimate is likely to be too low. However, data from a spreadsheet is unlikely to push the boundaries of ordinary memory settings, so that should be OK too, I think. {title:Returning file paths} {pstd}The older version of {cmd:collect} had an option to return the list of file paths in {cmd:r(files)}, without collecting the data. That function has been moved, with some improvements, into the new command {help collectpaths}. {title:Examples} {pstd}Using unrelated semicolon-separated paths: {pstd}{cmd:. collect a:/one/path.dta; b:/another/path.dta} {pstd}Using wildcards to select multiple directories and files: {pstd}{cmd:. collect a:/multiple*/dirs*/these*.dta} {pstd}Using wildcards and regular expressions to select multiple directories: {pstd}{cmd:. collect a:/multiple*/regex([^x]+)/this.dta} {title:Also see} {pstd}Online: {help collectpaths} {pstd}Contact: {browse "mailto:elliott.lowy@va.gov":Elliott Lowy}