Ensured Sanitized Column Names are Unique in AutoML CLI#5177
Ensured Sanitized Column Names are Unique in AutoML CLI#5177mstfbl merged 6 commits intodotnet:masterfrom
Conversation
| /// <summary> | ||
| /// Utilize <see cref="ML.CodeGenerator.Utilities.Utils.GenerateClassLabels(ColumnInferenceResults, IDictionary{string, CodeGeneratorSettings.ColumnMapping})"/> | ||
| /// </summary> | ||
| internal IList<string> GenerateClassLabels(IDictionary<string, CodeGeneratorSettings.ColumnMapping> columnMapping = default) |
There was a problem hiding this comment.
I've decided to keep this GenerateClassLabels as it is possible that a CodeGenerator instance can call this function, as it does below:
| Name = "ModelInput.cs", | ||
| }; | ||
| NamerFactory.AdditionalInformation = "null_map"; | ||
| NamerFactory.AdditionalInformation = info + "_null_map"; |
There was a problem hiding this comment.
I made this change to produce and check for 4 "approved" files within 1 unit test, instead of making a new unit test consisting of mostly duplicate code from this test.
| Name = "ModelInput.cs", | ||
| }; | ||
| NamerFactory.AdditionalInformation = "map"; | ||
| NamerFactory.AdditionalInformation = info + "_map"; |
There was a problem hiding this comment.
Same logic as above.
|
|
||
|
|
||
| } | ||
| public class CodeGenTestData |
There was a problem hiding this comment.
This is the data being fed to the unit test ClassLabelGenerationTest. I went this way as it is more efficient to iterate through test data in one ClassLabelGenerationTest unit test, than to have multiple unit tests that consist of duplicate code.
Codecov Report
@@ Coverage Diff @@
## master #5177 +/- ##
==========================================
- Coverage 73.54% 73.53% -0.01%
==========================================
Files 1010 1014 +4
Lines 188029 188879 +850
Branches 20266 20329 +63
==========================================
+ Hits 138288 138901 +613
- Misses 44272 44472 +200
- Partials 5469 5506 +37
|
| /// <summary> | ||
| /// Utilize <see cref="ML.CodeGenerator.Utilities.Utils.GenerateClassLabels(ColumnInferenceResults, IDictionary{string, CodeGeneratorSettings.ColumnMapping})"/> | ||
| /// </summary> | ||
| internal IList<string> GenerateClassLabels(IDictionary<string, CodeGeneratorSettings.ColumnMapping> columnMapping = default) |
| result.Add(sb.ToString()); | ||
| result.Add("\r\n"); | ||
| } | ||
| for (int i = 1; i < result.Count; i+=3) |
There was a problem hiding this comment.
Could you make sure the generated label are identical for other source files in generated project?
For example, here in Program.tt
https://github.com/mstfbl/machinelearning/blob/42dfa105fddf04ab23d0e574580edf6f99a955a6/src/Microsoft.ML.CodeGenerator/Templates/Console/PredictProgram.tt#L47
and here: (SampleData also call into Util.Normalize)
https://github.com/mstfbl/machinelearning/blob/42dfa105fddf04ab23d0e574580edf6f99a955a6/src/Microsoft.ML.CodeGenerator/Templates/Console/PredictProgram.tt#L27
There was a problem hiding this comment.
Currently, these generated labels are generated using Utils.Normalize in GenerateSampleData, and hence are not added the _col_x suffix for conflicting column names. I am making an issue to track this change. Thanks!
Fix #3902. This PR ensures that column names that are pulled from datasets and used to generate C# model input are unique. As a result, naming collisions are prevented and are valid C# variables. This is done by adding the column naming-collision fix in
ML.CodeGenerator.Utilities.Utils.GenerateClassLabels.This naming-collision fix consists of appending the type of the variable to the name of the second duplicate. For example, let us have 3 columns in a sample dataset:
This is represented with
TextLoader.Columns[]as:For
string countryto be a valid class variable, its first letter has to be capitalized, which would result in a naming collision. This PR ensures that this doesn't happen, and the generated code becomes:In addition, this PR ensures non-unique column names are properly named and differentiated, as shown below:
This PR also adds unit tests to test naming collision scenarios in both
CodeGenTestsandConsoleCodeGeneratorTests.In addition, this PR also cleans up duplicate code with respect to the
ML.CodeGenerator.Utilities.Utils.GenerateClassLabelsfunction, by linking its duplicate functionML.CodeGenerator.CodeGenerator.GenerateClassLabelsto the former.