Processing Fixed Width and Complex Files

Pointers

The first decision you will have to make is if it's structured at all. If it is a known type like CSV, JSON, AVRO, XML or Parquet then just use a record.

If it's semi-structured like a log file, GrokReader may work or ExtractGrok.

If it's like CSV, you may be able to tweak the CSV reader to work (say header or no header) or try one of the two CSV parsers NiFi has (Jackson or Apache Commons).

If it's a format like PDF, Word, Excel, RTF or something like that, I have a custom processor that uses Apache Tika and that should be able to parse it into text. Once it is text you can probably work with it.

Examples

Documentation

https://nifi.apache.org/docs/nifi-docs/

Processors To Use For File Manipulation

AttributesToCSV
AttributesToJSON
ConvertExcelToCSVProcessor
ConvertRecord
ConvertText
CSVReader
EvaluateJSONPath
EvaluateXPath
EvaluateXQuery
ExecuteScript
ExecuteStreamCommand
ExtractGrok
ExtractText
FlattenJson
ForkRecord
GrokReader
JsonPathReader
JsonTreeReader
JoltTransformJSON
JoltTransformRecord
LookupAttribute
LookupRecord
MergeContent
MergeRecord
ModifyBytes
ParseSyslog*
PartitionRecord
QueryRecord
ReaderLookup
ReplaceText
ReplaceTextWithMapping
ScriptedReader
ScriptedRecordSink
ScriptedTransformRecord
SegmentContent
SplitContent
SplitJson
SplitRecord
SplitText
SplitXml
SyslogReader
TransformXml
UnpackContent
UpdateAttribute
UpdateRecord
ValidCsv
ValidateRecord
ValidateXml

Custom Processors

Helper Projects, SDK, Libraries and Services

https://tika.apache.org/ - Apache Tika can be integrated as a custom processor or called via REST and run as a seperate server/service.
Cloudera Machine Learning - call this service from REST and have AI do it. https://blog.cloudera.com/integrating-machine-learning-models-into-your-big-data-pipelines-in-real-time-with-no-coding/
REST Service - there may be a service you can run locally or use in the cloud that may be able to parse it. NiFi can call this
Python - execute a stream command and have Python or a shell script or OS executeable do it!
Spark - try custom Spark with Java, Python or Scala.
Flink - try custom Flink with Java.
XSLT
XPath
XQuery
JsonPath
Json
https://github.com/AbsaOSS/cobrix
https://github.com/tspannhw/EverythingApacheNiFi
You may need to use Cache: https://www.datainmotion.dev/2021/01/flank-using-apache-kudu-as-cache-for.html