Sync Files with FilesFrom* Commands

Overview of FilesFrom* commands

This article describes the ability in a flow pipeline to automatically detect new or modified files in a remote file repository and process them.

To do this, use one of the new “FilesFrom” commands in a flow pipeline:

  • FilesFromFtp: synchronize from a remote FTP or SFTP folder.
  • FilesFromZip: not necessarily a remote repository, but behaves the same way

Both FilesFrom commands follow a similar logic. A flow pipeline connects to a set of files (ie a remote file repository or zip file) and then, for each new or modified file, passes that file fo a normal pipeline, which loads the data to the target table(s). One batch is created per file generated by the FilesFrom* command, unless the file was previously processed and has not changed since the last load. The system internally keeps track of the remote modified timestamp and size of the remote file, and uses that information the next time it runs to determine if it’s changed or not.

By scheduling such a flow pipeline you can effectively “synchronize” with a remote folder, even if the remote folder has hundreds or thousands of files.

The syncronization has two modes, AUTO (default) to only process new or updated files and ALL to process all files from the source, regardless of whether they have been loaded before or not.

FilesFromFTP command

Here is an example of the FilesFromFtp command. Below you may see the flow pipeline which creates a connection with a remote SFTP file repository and then, for each file matching the FileNamePattern, launches the LOAD_HAQ_ZIP origin. In this example only new or files modified since the last run will be processed because the Sync attribute is not specified and defaulting to “Auto”.

<XmartPipeline>
  <!-- Flow mode: PARALLEL (default) or SEQUENTIAL -->
  <Flow Mode="PARALLEL">
    <FilesFromFtp Url="sftp://eusend-sftp.eusfx.ec.europa.eu:2222/toUser/"
                  FileNamePattern="^HCSHA_2011NAT_A_BE_20\d\d_.+\.zip$"
                  ConnectionName="eDamis">
      <RunPipeline OriginCode="LOAD_HAQ_ZIP" />
    </FilesFromFtp>
  </Flow>
</XmartPipeline>

And here is the section of the normal pipeline with origin LOAD_HAQ_ZIP. Since we are expecting to retrieve zip files from the ftp server, the GetZip command is being used to process the files. Since the zip contains a single excel file, the GetExcel command is being used to process the excel file:

<XmartPipeline>
  <Extract>
  <GetZip FileNameToExtractPattern="xlsm" Origins="LOAD_HAQ_ZIP" >
    <GetExcel TableName="*" FindStartingRow="SHA"
              RemoveBlankRows="false" RemoveBlankCols="false"
              TableTypes="Worksheets" TrimValues="true" RepeatMergedCellsValue="true" 
              StrictTypes="true" />
  </GetZip>   
  </Extract>
  <Transform>
    <MergeTables TableNames="*" SkipTableNames="General,SourceSystem" MergedTableName="data"
                 KeepTables="false" />
    <CopyTable TableName="data" Columns="DCSourceTable" OutputTableName="CatCheck" />
    <SetContext ActiveTables="CatCheck" />
    <CleanTable RemoveBlankRows="true" RemoveBlankColumns="true" TrimValues="true" />
    <RemoveDuplicates />
  <!-- In the actual pipeline more transformation is taking place which is removed for this example-->   
  </Transform>
  <Load>
    <LoadTable SourceTable="raw_dms_survey" TargetTable="FACT_DMS_SURVEY" LoadStrategy="MERGE" >
      <Transform>
   <!-- In the actual pipeline more transformation is taking place which is removed for this example-->
        <CleanTable RemoveBlankColumns="false" RemoveBlankRows="true" TrimValues="true" />
	  </Transform>
      <ColumnMappings Auto="true" />
    </LoadTable>  
  </Load>
</XmartPipeline>

In this case, 20 files were located when the flow pipeline was executed, so as expected a pipeline chain was created:

TODO UPDATE THIS IMAGE WITH NEW UI

pipeline chain files from ftp

For the full details of this command, please see the pipeline script guide for the FilesFromFTP command.

FilesFromZip

Below is an example of the FilesFromZip command used in a flow pipeline. FilesFromZip will generate one batch for each file in the zip which matches the FileNameToExtractPattern, by calling the pipeline with origin PIPE_ZIP. The zip file is treated like a remote file repository; the system will remember and compare the file timestamp and size of each processed file and skip processing next time if the file has not changed.

<XmartPipeline>
  <Flow Mode="PARALLEL">
    <FilesFromZip FileNameToExtractPattern="^country_demo\d+\.xlsx"> 
      <RunPipeline OriginCode="PIPE_ZIP"/>
    </FilesFromZip>
  </Flow>
</XmartPipeline>

And here is the section of the normal pipeline with origin PIPE_ZIP in this case. It will receive a single excel file as extracted by the FilesFromZip in the flow pipeline.

<XmartPipeline>
  <Extract>
      <GetExcel TableName="*"   OutputTableName="data" StartingRow="1"/>
  </Extract>
  <Load>
    <LoadTable SourceTable="data" TargetTable="FACT_COUNTRY_DEMO" LoadStrategy="MERGE">
      <ColumnMappings Auto="true"/>
    </LoadTable>
  </Load>
</XmartPipeline>

In this case, 2 files were found in the zip when the flow pipeline was executed:

pipeline chain files from zip

For the full details of this command, please see the pipeline script guide for the FilesFromZIP

Sync=”ALL” option

The syncronization has two modes, AUTO (default) to only process new or updated files and ALL to process all files from the source, regardless of whether they have been loaded before or not. Use ALL when it is necessary to reprocess all files. To minimize load on the system, please set it back to AUTO, to prevent re-processing unchanged files.